Research:Wikimedia versus traditional biographical encyclopedias

Created
19:52, 3 June 2024 (UTC)
Collaborators
Duration:  2024-07 – 2025-06
Grant ID: G-RS-2402-15215

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Wikipedia has become the primary source for biographical information due to its extensive coverage and accessibility. However, Wikipedia and Wikidata inherently rely heavily on the output of research institutions, in part because of the No Original Research policy. This project aims to analyze the production of traditional biographical dictionaries, examine their relationship to Wikipedia and Wikidata, identify current problems, and propose solutions to improve collaboration between Wikimedia and the creators of traditional biographical dictionaries.

Full grant proposal

Methods

edit

The quantitative analysis will primarily identify language, cultural, gender, socio-economic, geographic, and other representation gaps in the content of Czech Wikipedia, Wikidata, and academic biographical dictionaries.[1]

The qualitative (content) analysis will focus on entities in Wikidata compared to traditional biographical dictionaries. Using the machine learning tools (HTR, NLP), basic data (date/place of birth/death, studies, occupation, works) will be extracted from relevant entries on Wikipedia and Wikidata and from the entries in the traditional biographical dictionaries. Based on the comparison of the datasets, solutions will be proposed to improve the quality of Wikipedia and Wikidata by filling identified gaps.

Results

edit
  1. International conference – a platform for presenting the interim results of the analyses, formulating the ideas and opinions of the institutions producing biographical dictionaries and Wikimedia representatives, strengthening contacts, and finding common solutions to the current situation.
  2. Scientific publication – analysis and proposed solutions for Wikimedia and biographical dictionary developers.
  3. Machine learning models – HTR model,[2] and model for data extraction from biographical records for Wikidata contributors.[3]

Resources

edit

Provide links to presentations, blog posts, or other ways in which you disseminate your work.

Will be provided after the project launch (July 2024).

References

edit
  1. For gap definitions, see Miriam Redi, Martin Gerlach, Isaac Johnson, Jonathan Morgan, Leila Zia, A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft), 2021, pp. 21–24 arXiv:2008.12314.
  2. Preliminary segmentation tool: https://doi.org/10.5281/zenodo.10783346.
  3. Preliminary version of the tool for transforming biographical data into Wikidata format: https://doi.org/10.57967/hf/1898