Research:Social Memory about Chileans in Wikipedia

21:14, 4 April 2024 (UTC)
Carlos Cruz Infante
Duration:  2023-6 – 2024-6
Wikidata, social memory, digital discourse, knowledge gaps

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This project investigates Chilean biographies on Wikipedia, examining how content gaps evolve across generations of notable people. Our goal is to explore these gaps in various occupational domains (e.g., politics, science, art, and sport) and understand, with these trends, how Wikipedia is framing (highlighting aspects of) the history of notable Chileans. We also like to document the methods to facilitate the replication of this study in multiple countries.

Research objectives edit

1. Describe how the social memory about notable Chileans is structured in Wikipedia, considering both the selection of biographies and the text that appears in those articles.

2. To explore knowledge gaps about categories of people, focusing on distributions according to gender, place of birth, and occupation.

3. Analyze biographical content according to the period of birth of notable persons to understand how Wikipedia portrays the history of notable Chileans.

4. To disseminate and document methods that allow these analyses to be quickly replicated in other countries.

Related work edit

Knowledge gaps in biographies

Many studies have shown that the biographies on Wikipedia tend to make more visible specific people categories. For example, they are significantly focused on men (Hinnosaar, 2019) born in the Global North (Beytía, 2020), who lived in the last century (Samoilenko et al., 2017), and who excelled in professions such as mass arts or popular sports (Reznik & Shatalov, 2016).

In previous research, we have investigated biographical content asymmetries related to gender and place of birth:

  • In a study based on the Networked Pantheon dataset (Beytía & Schobin, 2020), we found that only five countries in the Global North concentrate more than 62% of Wikipedia's biographical coverage. In addition, we estimate that the inequality in coverage between countries reaches a Gini coefficient of .84 (Beytía, 2020).
  • We examined the written and visual asymmetries between men's and women's biographies in the ten most widely spoken languages (Beytía et al., 2022). We concluded that (a) most of the male bias arises when selecting who will have a biography, (b) written and visual asymmetries do not follow the same patterns of disparities, (c) men biographies tend to have more images across languages, and (d) female biographies average better visual quality.
  • In another study, we proposed a general theoretical framework to observe content asymmetries in Wikipedia closely, which was tested with research findings on gender gaps (Beytía & Wagner, 2022). Our "Visibility Layers" model analyzes content inequalities across three editorial stages (content selection, building, and positioning) that contribute to making groups of biographies more or less visible.

Exploring historical trends

The literature analyzed examines the knowledge asymmetries accumulated in Wikipedia biographies, considering the lives of people from all eras. But it is important to differentiate the life periods of notable Chileans. That allows us to understand whether knowledge asymmetries are more or less significant in notable people of certain periods and to analyze which historical discourse is implicitly promoting this record of biographies.

A line of research has analyzed temporal evolution in biographies by looking at specific variables. For instance, biographies have been examined to observe the evolution of occupations across generations (Jara-Figueroa et al., 2019), variation in migration patterns (Menini et al., 2017), and changes in biographical ties (Schich et al., 2014). Usually, these studies have analyzed content-specific variables (occupational distribution, migration, biographical relationships) and not the joint evolution of multiple content asymmetries across generations. Therefore, they do not offer a comprehensive look at how biographies frame the history of any specific human group.

In a recent study, we attempted to develop a more comprehensive approach, examining how Wikipedia frames the history of sociology in multiple and interrelated information structures (Beytía & Müller, 2022). There, we studied content structures and asymmetries across all generations of notable sociologists. For instance, we analyzed trends in gender composition, geographical centers, discursive centrality of notable people, and clusters of biographical connections (Beytía & Müller, 2022).

Through this research, we aim to replicate the “Wikipedia-framing” approach, albeit with a different subject matter. Instead of studying the history of a specific scientific discipline (sociology), we investigate how Wikipedia frames the history of notable people in a specific country and across different occupations. To our knowledge, this topic has not been investigated hitherto now, especially examining the joint analysis of multiple content asymmetries that vary across generations.

Methods edit

We will focus on specific occupational categories that represent socially diverse areas of notable people: politicians, scientists, artists, and sportspersons. Previous research has shown that, in the last generations of notable people (those born since the 1950s), those occupational dimensions are the ones with the highest number of biographies on Wikipedia (Reznik & Shatalov, 2016; Yu et al., 2016). Except for sportspersons, those same categories are also the dominant occupations in the generations born since the beginning of Chile's history (early 19th century).

We will analyze the biographies of those occupational dimensions using a five-stage methodology:

1. Data extraction: we will create databases of people with Chilean nationality in the four occupational domains using Wikidata. From that source, we will extract essential biographical information (name, gender, year of birth, birthplace, year of death, death place, a portrait, and biography hyperlink). We will also obtain information on the notables’ participation in sub-domains (i.e., political parties, scientific disciplines, sports branches, and artistic disciplines). As an approximation to the communicative influence of each notable, we will get the number of available languages for each biography.

2. Preprocessing: we will manually check the data for correctness. We will also complete the record if any relevant biography is missing and available on Wikipedia.

3. Natural language processing (NLP): we will perform NLP on the Spanish biographies to recognize relevant discursive entities (people, events, places, dates, and organizations). We will start our analysis from the hyperlinks extracted from Wikidata. Then, we will extract the biographical text and employ the Python library Spacy to develop a named-entity recognition (NER). We will do this processing separately for each occupational domain. Finally, we will review the lists of recognized entities (people, events, places, dates, and organizations) to check for repeated terms, incomplete names, or nonsense phrases.

4. Research products: we will prepare (1) an open database of notable Chileans in Wikipedia, (2) an open database of the main entities that we found in the NLP, and (3) scripts and documentation on how to extract this data from Wikidata and perform the named-entity recognition process with Wikipedia biographies. All this material will be released under CC BY-SA 4.0 or a more permissive license.

Policy, Ethics and Human Subjects Research edit

We do not see ethical concerns with this research topic and methods. Also, we will not disrupt Wikipedians' work in our research stages.

Results edit

This project includes the following outputs:

1. An open database with biographies of Chileans and their key characteristics.

Download here the most updated version (16.04.2024).

2. An open database with entity recognition analysis of those biographies.

Download here the most updated version (16.04.2024).

3. Open-access documentation to extract the data and create similar research projects (e.g. to analyze Wikipedia's social memory on other countries).

  • Creation of biographies database:

- Documentation (pdf file) - Google Colab Notebook

  • Name Entity Recognition (NER) in biographies:

- Documentation (pdf file) - Google Colab Notebook

4. A report of results submitted to the Wiki Workshop 2024 and Wikimedia Chile.

We submitted the following abstract:

Beytía, Rojas & Cruz (2024). Social memory about people from a country. The case of notable Chileans in Wikipedia.

After finishing this project, we expect to publish our analyses in an academic journal and create interactive tools that allow Wikipedians to explore knowledge gaps easily.

Resources edit

Here, you can see presentations, interactive charts, and other materials to disseminate our work.

1. Explorer of notable Chileans' occupations: an interactive chart to review our occupational classification.

2. Categories of occupations across decades: an interactive chart to analyze the occupational evolution of notable individuals.

References edit

Beytía, P. (2020). The Positioning Matters: Estimating Geographical Bias in the Multilingual Record of Biographies on Wikipedia. Companion Proceedings of the Web Conference 2020, 806–810.

Beytía, P., Agarwal, P., Redi, M., & Singh, V. (2022). Visual Gender Biases in Wikipedia: A Systematic Evaluation across the Ten Most Spoken Languages. AAAI Conference on Web and Social Media (ICWSM).

Beytía, P., & Müller, H.-P. (2022). Towards a Digital Reflexive Sociology: Using Wikipedia’s Biographical Repository as a Reflexive Tool. Poetics, 101732.

Beytía, P., & Schobin, J. (2020). Networked Pantheon: A Relational Database of Globally Famous People. Research Data Journal for the Humanities and Social Sciences, 5, 1–16.

Beytía, P., & Wagner, C. (2022). Visibility Layers: A Framework for Facing the Complexity of the Gender Gap in Wikipedia Content. Internet Policy Review.

Entman, R. M. (1993). Framing: Toward Clarification of a Fractured Paradigm. Journal of Communication, 43(4), 51–58.

Entman, R. M. (2007). Framing bias: Media in the distribution of power. Journal of Communication, 57(1), 163–173.

Hinnosaar, M. (2019). Gender inequality in new media: Evidence from Wikipedia. Journal of Economic Behavior & Organization, 163, 262–276.

Jara-Figueroa, C., Yu, A. Z., & Hidalgo, C. A. (2019). How the medium shapes the message: Printing and the rise of the arts and sciences. PloS One, 14(2), e0205771.

Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E., Tonelli, S., & Lepri, B. (2017). Ramble on: Tracing movements of popular historical figures. Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 77–80.

Redi, M., Gerlach, M., Johnson, I., Morgan, J., & Zia, L. (2021). A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft). ArXiv:2008.12314 [Cs].

Reznik, I., & Shatalov, V. (2016). Hidden revolution of human priorities: An analysis of biographical data from Wikipedia. Journal of Informetrics, 10(1), 124–131.

Samoilenko, A., Lemmerich, F., Weller, K., Zens, M., & Strohmaier, M. (2017). Analysing timelines of national histories across wikipedia editions: A comparative computational approach. Eleventh International AAAI Conference on Web and Social Media.

Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., Barabási, A.-L., & Helbing, D. (2014). A network framework of cultural history. Science, 345(6196), 558–562.

Wagner, C., Graells-Garrido, E., Garcia, D., & Menczer, F. (2016). Women through the glass ceiling: Gender asymmetries in Wikipedia. EPJ Data Science.

Wikipedia. (2023). Wikipedia:Purpose. In Wikipedia.

Yu, A. Z., Ronen, S., Hu, K., Lu, T., & Hidalgo, C. A. (2016). Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific Data, 3, 150075.