Wikipedia Diversity Observatory
|Cultural Context Content (CCC)||Gaps (Coverage and Spread)||Top CCC Articles Lists||Languages||Get Involved|
The Wikipedia Diversity Observatory (WDO) is a space to study Wikipedia's diversity coverage, discuss the strategic needs, and propose solutions to improve it.
To do so, it aims at raising awareness on Wikipedia’s current state of diversity by providing datasets, visualizations, and statistics, as well as pointing out solutions and tools.
This project’s vision is to align the movement to achieve diversity in the content of the different project .
This project's mission is to create a joint space for researchers and activists to study and fight against the cultural knowledge gaps and promote knowledge equity. Hence, we provide strategic valuable data and resources to organize and take action.
These are the three main outcome goals we are working on to increase the diversity within the Wikimedia projects:
Main outcome goals:
- Every Wikipedia language edition ensures a minimal representation of their own territories’ cultural context (from geography to biographies, traditions, language, and others).
- Every Wikipedia language edition ensures minimal coverage of every other language cultural context content.
- Every Wikipedian has information about marginalized languages without a Wikipedia so he can help out their speakers to create one and start representing their cultural context.
In order to reach these goals, we detail some other more specific goals in community engagement and research and development activities of the project.
Community engagement goals:
- Every Wikipedia language community is aware and knows about the knowledge inequalities in the entire Wikipedia project.
- Every Wikipedia language community is aware of the importance of representing her own culture so the rest of language editions users can import and learn from it.
- Every Wikipedia event and community organized contest considers dedicating sections and activities aimed at mitigating the cultural knowledge gaps and derived inequalities.
Research and development goals:
- Every Wikipedian has access to some data visualization tools in order to browse the gaps and create new valuable articles.
- Every Wikipedian has access to some statistical analysis on the extent of the gaps and understands the priorities in order to bridge or cover them.
- Every Wikipedian has access to some data on the world's languages without a Wikipedia in order to disseminate the importance and try to engage in creating one.
Framing the problem of Cultural Diversity
“The sum of human knowledge” is not in a single language but in the existing cultural diversity from every territory and language in the world. We have to work on very different aspects and align all the Wikimedia movement stakeholders to facilitate the creation of content that ensures articles that show cultural diversity.
We see this as a two-step process or two sequential processes: representation and sharing. For each language, the process of representation implies creating content that relates to the geographical and cultural context from the editors. Instead, the process of sharing implies understanding where the gaps are both in the own language and in the others, in order to exchange each others' cultural context content and increase all languages' cultural diversity.
In order to facilitate cultural context representation, we propose:
- Create, collect, process, and present different sorts of metrics and tools to describe the creation and usage of cultural content on Wikimedia projects.
- Understand the situation of all the world's languages that could become Wikipedia language editions, and consequently, the potential content about their cultural context they would bring to the entire Wikipedia project.
In order to facilitate each language the sharing (import and export) of all languages cultural context content, we propose:
- Ideate and develop tools that prioritize and allow finding the most valuable content (popular and relevant) that might be essential to be created across projects.
- Provide training to organizations and individuals in these tools so that they can help mitigate the knowledge gaps and increase the cultural diversity in Wikimedia projects.
Not all languages are in the same position in order to achieve good coverage of the world's cultural diversity. Usually, languages represent their cultural context first and build the capacity and maturity later in order to create articles about every other language's cultural context. It is possible to compare and discuss the maturity level of a language edition in terms of content cultural diversity according to several aspects we discuss in this preliminary model.
You can learn more about how to improve cultural representation and share on these guidelines.
Cultural diversity tools
As an observatory, the outcomes of this project bridge the gap between research and activism more than focusing on the content creation itself. This portal itself provides results. Most of the visualizations are located or better depicted at an external website (wcdo.wmflabs.org) created with Plotly hosted in Toolforge.
Even though some results are repeated in both sites, those at the external website are preferable as they allow better user interaction with the data. For example, the tables from List of Wikipedias by Cultural Context Content allow filtering feature not available in List of Wikipedias by Cultural Context Content.
This project is continually developing research questions, concepts, dashboards, visualizations, and tools.
WCDO's main concepts are Cultural Context Content, Culture Gap, Top CCC Diversity Lists, and Missing CCC articles:
Cultural Context Content (CCC) aka Local Content
Cultural Context Content (CCC) (methodology) is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.) (Figure 1). You can see this Youtube video explaining its creation and use.
In order to create any CCC it is necessary to establish a language territories mapping, in other words, to pin out the territories where the language is spoken as native or with official legal status.
Cultural Context Content is collected as a group of datasets (Figure 2), which are released on a monthly basis. These datasets are used to compute and depict several statistics on the state of knowledge equality and cross-cultural coverage.
For example, it is possible to consult the extent of CCC in each Wikipedia language edition (List of Wikipedias by Cultural Context Content) or even the amount of articles from a particular territory in one language edition CCC (List of Language Territories by Cultural Context Content).
The culture gap occurs when a Wikipedia language edition is not covering articles that belong to another language edition CCC. Around 50% of the articles non-existing across language editions (language gap) is due to the culture gap.
In order to compute the culture gap and other statistics, WCDO proposes calculating the intersections between differents sets of articles (e.g. common articles between all articles from English language edition and articles from Japanese CCC). The use of intersections allows seeing the absolute number of articles and their extent (the relative importance) in each of the two sets.
In these two tables, it is possible to see the culture gap in two different ways. First, the spread of a language CCC on the rest of Wikipedia language editions, and, second, the coverage of all the languages CCC.
Top CCC Diversity lists
Wikipedia language editions should not be a replica of each other and the gap may never be completely closed. However, minimal coverage of all other languages should be a goal on the agenda of each Wikipedia edition to create more multicultural (and complete) encyclopaedias.
Top CCC articles lists can help in providing content for this minimal cultural coverage. Inspired by the Vital articles lists, the Top CCC articles present the most relevant articles in terms of different metrics (e.g. the number of editors or pageviews) and specific content types (e.g. geolocated articles or women) from a language cultural context or country's cultural context.
The Top CCC articles current generated lists are: list of CCC articles with most number of editors (Editors), list of CCC articles with featured article distinction (Featured), most bytes and references (weights: 0.8, 0.1 and 0.1 respectively), list of CCC articles with geolocation with most links coming from CCC, list of CCC articles with keywords on title with most bytes (Bytes), list of CCC articles categorized in Wikidata as women with most edits (Women), list of CCC articles categorized in Wikidata as men with most edits (Men), list of CCC articles created during the first three years and with most edits (First 3Y.), list of CCC articles created during the last year and with most edits (Last Y.), list of CCC articles with most pageviews during the last month (Pageviews), list of CCC articles with most edits in talk pages (Discussions).
On this page, you can consult the list from a particular country or language CCC generated on a monthly basis from the latest CCC dataset. You need to specify the list parameter (editors, featured, geolocated, keywords, women, men, created_first_three_years, created_last_year, pageviews, and discussions), the language target parameter (as target_lang and the language wikicode), the language origin (as source_lang and the language wikicode), and, optionally to limit the scope of the selection, the country origin parameter as part of the CCC (as source_country and the country ISO3166 code). In case no country is selected, the default is 'all'.
One possible URL with Top CCC list by number of editors, language origin Spanish, language target Italian and no country would be: https://wcdo.wmflabs.org/top_ccc_articles/?list=editors&source_lang=es&target_lang=it
A similar list but limited to a specific country and to women would be:
The generated table includes several metrics and shows the availability in the top right column with the current title (in case it exists) or one possible title generated by the Content Translation tool or by a Wikidata label.
Another way to browse the lists is by examining how well a language edition covers the other language editions Top CCC articles lists (centered around countries, as Countries Top CCC article lists), or how well spread are one particular language editions Top CCC lists on the rest of language editions.
In this case, it is necessary to specify the language covering or spreading the lists with the lang parameter. This is an example using Catalan Wikipedia:
- Languages Top CCC articles spread from Catalan Wikipedia.
- Languages Top CCC articles coverage by Catalan Wikipedia.
- Countries Top CCC articles coverage by Catalan Wikipedia.
Missing CCC articles
Normally Wikipedia language editions tend to cover their own cultural context (from territories to all the cultural expressions) much better than others. However, in around 150 languages their cultural context content is below 10% of the content, which is a sign that it is likely underrepresented. In this case, it very possible that larger Wikipedia language editions have articles that are missing in their CCC. Sometimes these languages are English, French Russian and Spanish, which are the languages that usually coexist with other languages with Wikipedia (only 48 Wikipedia language editions are of languages that do not coexist with other languages in one territory).
In order to improve the representation of local content in these underdeveloped Wikipedias, we proposed the creation of a tool named "Missing CCC articles". This allows us to query articles that should exist in one language CCC but they have not been created yet, and instead, exist in other languages. Additionally, we can also query articles from a language CCC that are longer in another language edition.
It is possible to query any list by changing the URL parameters or by using the following menus. You first need to select the target language (where you would like to improve local content representation). Additionally, if you want to aim at a specific part of a language context, you can select the target country and target region - they are optional and allow you to filter for a specific area. For instance, for Target language French, whose language context encompasses several countries, Target country and Target region could be France and Québec.
One possible URL with a query for Luganda CCC about Uganda and Geolocated content that is found in any other language edition would be:
Disclaimer: This tool is still at the Alpha phase and may contain some bugs. Your feedback can be useful.
Common CCC articles
Cultural Context Content is a selection of articles that relate to language-related territories, their people and customs. Even though the selection provides a defined group of articles, we must acknowledge that some articles may belong to more than one cultural context. It can either be a celebrity who was born in one country but did most of her career in another, a historical battle in which two or more armies intervened, among other cases. For this, we can say that the cultural context is a continuum.
In order to find the articles that are found in-between two cultural contexts and may belong to more than one selection, we have created a tool named "Common CCC articles". This allows us to search for articles common to two language editions’ cultural context content and the gaps in other language editions.
It is possible to query articles by adding source languages (the first one sets the reference CCC, and the other languages will be used to filter the resulting list of articles depending on how related they are to it). The Target Languages parameter allows you to select a list of languages in which you want to check whether the resulting list of articles exist in their language editions. You can also filter the results to only show you the gaps in the target languages.
One possible URL with a query for Ukrainian and Polish cultural context content and its availability in English and Russian would be:
Disclaimer: This tool is still at the Alpha phase and may contain some bugs or give undesired results. Your feedback can be useful.
Visual CCC articles
Usually Cultural Context Content is much more developed than the rest of the articles, in terms of the number of references, length, and the number of images. However, in some cases, the version of the article in another language contains valuable images. This tool allows you to check which images are more used for every article belonging to the Top CCC Diversity Lists or any list you want to paste. You can query a list of articles and their images and choose also the number of images you want to see or to see only those that are missing in the original article. For example, if we want to see images (only 4) of the Catalan Top CCC list "Women", and only those missing, we would use the following URL:
Incomplete CCC articles
Usually Cultural Context Content is much more developed than the rest of the articles, in terms of the number of references, length, and the number of images. As said before, in some cases they are more complete in other languages. On this page, you can check whether the articles of a language edition you introduce manually or a Top CCC list is more complete in other language editions. In other words, you can compare each article stats (number of Bytes, number of references, number of images, number of outlinks, among others) in other languages, and then, decide whether to expand these articles or not. You can also compare engagement characteristics (e.g. number of editors, number of edits or number of pageviews) or the 'featured article' distinction.
For example, if we want to see the articles from the Czech Top CCC list "Geolocated" that are more complete in other languages we would use the following URL:
Search CCC articles
On this page, you can search for articles in a Wikipedia language edition and see their availability in other language editions. First, you need to select the Source Language where you want to retrieve the content from. Then you can choose the Type of query: List of articles, List of categories articles, Wikidata SPARQL Query, and Wikipedia Content Search.
The List of articles query simply allows you to introduce a list of articles (their titles or their URLs separated by a comma, semicolon or a line break) in the textbox in order to see the main stats and their availability in the Target Languages. The List of categories' articles allows you to introduce a list of categories and retrieve the articles contained in them. The Wikidata SPARQL Query allows you to introduce a query in the textbox and retrieve the articles related to the Qitems that appear in them (if the query does not contain any Qitem and only labels, there will be no results). The Wikipedia Content Search allows you to introduce a query the same search engine of Wikipedia (CirrusSearch), for example, if you introduce the Source Language Japanese and the query "Japanese Cuisine", you will obtain the articles from Japanese Wikipedia along with their main stats on relevance features (number of editors, edits, discussion edits, pageviews, etc.). When using the search option, you can introduce the Language of the query and specify which language you are using to query (e.g. Japanese cuisine could be "cuisine du Japon" in French), no matter it is the same target language or not.
For example, if we want to see articles on Japanese Cuisine in Japanese Wikipedia and their availability in Catalan, Spanish, French, and English Wikipedia, we would use the following URL:
More tools (work in progress)
Current we want to use the CCC datasets to monitor the gaps on a continual basis (showing the creation of articles for specific kinds of content to show whether and where editors are really bridging the gap) along with many other lists, solutions, and improvements after all the feedback gathered in past Wikimedia events and from local communities (Figure 4). Likewise, we want to create a multilingual editors dashboard where to find potential collaborators. The editor must be able to query lists or visualizations to see editors from other language editions according to their cultural context interests.
Other diversity tools and research papers
We also want to provide a short overview on the different other tools and research papers created outside this project that are useful to understand and detect cultural differences between language editions and possibly bridge the gaps or work on other diversity problems like the content gender gap.
List of dashboards with tools and visualizations
This is a list of the different dashboards created to visualize the gaps and tools to provide points of action to work on them. They do not limit to cultural diversity but include other kinds of diversity based on geography or gender.
- Cultural Context Content (CCC)
- Cultural Gap (CCC Coverage and Spread)
- Geography Gap
- Gender Gap
- Topical Coverage
- Last Month Pageviews
- Diversity Over Time
- Languages Top CCC Articles Coverage
- Countries Top CCC Articles Coverage
- Languages Top CCC Articles Spread
These are the latest actions we did in order to raise awareness on the cultural diversity problem in Wikipedia. It is the dissemination of research results, concepts, and tools:
- 06/10/2019 | WikiArabia | Talk: The State of Cultural Diversity in Arabic Wikipedia: Insights and Challenges.
- 17/08/2019 | Wikimania | Poster: Wikipedia Cultural Diversity Dataset: helping editors to enrich cross-language coverage. This poster explained the dataset.
- 17/08/2019 | Wikimania | Poster: Maturity Levels for Cultural Diversity in Wikipedia Language Communities. This poster explained the different levels.
- 18/08/2019 | Wikimania | Diversity Talk: Wikipedia Cultural Diversity Observatory (WCDO): Empowering Communities to Bridge the Culture Content Gaps. This presentation explained the current state of the project with its new Missing CCC lists and also alerted of the lack of impact of Wikimania 2018 to bridge the African content gap (pdf slides and video).
- 18/08/2019 | Wikimania | Language Talk: Minoritized Languages and Missing Languages in Wikipedia: An Opportunity to Increase Cultural Diversity in Wikipedia. This presentation explained that to make Wikipedia more culturally diverse we need more languages (proposed a method to select them) and help minoritized languages to create their content (suggested a method to propose new articles) (pdf slides).
- 18/08/2019 | Wikimania | Readership Talk: Increasing Wikipedia Readership By Creating Local Content In Language Editions. This presentation explained that local content is vital in order to increase a language edition readership and gave some numerical reasons (pdf slides).
- 16/08/2019 | Wikimania | Research Talk: Cultural Diversity Funnels: A Metaphor To Study Wikipedia Communities and Knowledge Gaps. This presentation explained that there exist different barriers that stop cultural diversity representation and proposed the metaphor of a funnel in order to depict it.
- 05/07/2019 | Celtic Knot | Language Talk: Languages Matter to Cultural Diversity: Finding Missing Languages and Bridging the Gaps in Minority Languages”.
- 12/06/2019 | ICWSM Conference | Academic Paper/Presentation: Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media. ICWSM, Munich June 11-13th (ICWSM. ACM.)
- 16/05/2019 | Chapter for the book “Wikipedia@20” | “The Sum of Human Knowledge? Not in One Wikipedia Language Edition”.
This project also aims at raising debates on the different types of diversity. Some of the Wikimedia 2030 Strategy process discussions in the Diversity Working group are directed at improving diversity on content and in the current communities.
These are some of the recommendations that are related to the project:
- Content Diversity Metrics and Guidelines
- Parameterized User Pages for Encouraging and Measuring Community Diversity
- Identifying the Wikimedia Editing and Community Diversity Barriers in Each Country and Introduce Them in Wikidata
You can always contact us and engage in discussions. We believe the Wikimedia movement needs more discussions on diversity in order to encourage the necessary changes to become more inclusive and improve content coverage.
Activities / Get involved
The Observatory does need dissemination in order to reach all the possible Wikimedia events and activities where it could provide some value. If you want to collaborate, get involved. Leave your username and send us an e-mail at firstname.lastname@example.org.
The project welcomes all kind of contributors that want to participate in discussions, do some research on diversity, or simply fight the gaps. There are at least seven different types of activities or profiles from whom the project would benefit.
- Data retriever frames the problem of diversity and extracts the necessary data to study and throw light on it.
- Researcher/data analyst studies the data on a problem of diversity, extracts conclusions, and communicates them through visualizations.
- Communicator explains the conclusions in order to raise awareness in specific communities or movement-wide.
- Strategist proposes some top-priority goals, mechanisms, or principles based on research and the state of the communities.
- Creator proposes or creates a tool based on the data or any conclusion in order to improve any diversity-related problem.
- Developer/designer works on developing and refining the tools in order to make them as usable as possible.
- Program manager organizes programs with the communities including activities and using the tools in order to solve the problems.
The observatory is primarily involved in the first 7 activities. However, any of the seven can make a contribution and benefit from the work done.
(!) It is important to warn that most of the outcomes of this project are currently at a "prototype" stage, as most of the efforts are dedicated to data, research, and communication. In other words, they work and give the expected results, but require some design and development improvements in order to provide a better experience (in terms of speed and extra functions). If you are a professional developer, join and improve them!
Also, it is important to say that some of the work originated in other spaces (or projects in the Wikimedia sphere) but also on diversity can be disseminated or trigger new tools. Getting involved can be useful in order to find a meeting point or a place to start working on diversity.
In case you want to code some extra visualizations, you can find the project's code here: github page.