Research:History and cultural heritage of the Canary Islands in the Wikimedia projects

Contact

Iván Hernández Cazorla

Q72314524

ORCID: 0000-0002-5436-8535

Wikimedia España

Duration: 2019-05 – 2020-03

Open data
via github.com

Open source
via github.com

Research:Projects

This page documents a completed research project.

History and cultural heritage of the Canary Islands in the Wikimedia projects. Uses and advantages of the scientific knowledge transfer to free knowledge projects (official title, in Spanish: Historia y patrimonio cultural de Canarias en los proyectos Wikimedia. Usos y ventajas de la transferencia de conocimiento científico a proyectos de conocimiento libre). This project consists mainly in the analysis of state of the knowledge about the cultural heritage and the history of the Canary Islands there are in the Wikimedia projects, mainly in Wikipedia in Spanish, but in Wikimedia Commons and Wikidata too. In addition, it analyze the possible uses and advantages of a digital transfer of scientific knowledge to projects with free licenses that allow to reuse and spread their content legally and freely.

Objectives

The main objective of this project is to know the knowledge about the cultural heritage and the history of the Canary Islands in the Wikimedia Projects, specially in Spanish Wikipedia, Wikimedia Commons and Wikidata. Once this objective is achieved it would be possible to:

Evaluate the quantity and the quality of the knowledge.
Consider how and when the articles were created and what has been their evolution.

When were more articles created?
Does the interest in the elaboration of these articles increase or decrease?
Does the quantity of articles and the quality of them has been increased from their creation?

Consider the uses and advantages of the scientific knowledge transfers.

How this transfers could help to the improvement of the current content in the Wikimedia projects?
How could the institutions and individuals ease the digital and free access to knowledge to improve the current content?

Methods

The essential parts of this project are the ones dedicated to the technical aspects. It is developing a command line interface with Python for the data extractions: the one dedicated to collect the corpus of articles itself and the another one dedicated to the extraction of data referred to each page.

There are different ways to address this (e.g., analyze the official Wikipedia dumps), but the one chosen for this project is to review the pages of a set of categories. The categories are subcategories of the categories Cultura de Canarias (Culture of the Canary Islands) and Historia de Canarias (History of the Canary Islands).

HAC{k}ProWiki, a tool to perform the technical tasks

Thus during the project is being developed a tool, named HAC{k}ProWiki temporarily, to extract a corpus of pages in the content namespaces of the Spanish Wikipedia, the main one and Anexo.^{[notes 1]} This tool is developed in Python, using Pywikibot as a bridge to contact with the Wikipedia API. The main functions of the tool are:

Check the pages in the categories chosen, show to the operator and give the options to add it to the corpus (a CSV file) or discard it. Each page is included with an identifier.
From each page of the corpus (a row), extract the data (see Data corpus) in another corpus (CSV file).

Once the project is finished, the process will be described with details and will be exposed the benefits and problems using the tool. In addition, it is probably that the tool will continue under development to make it available to anyone who wants to perform similar tasks.

Data corpus

We want to get at least the next data:

Date of creation with the identifier (oldid) to the first version.
Date of last edit with its identifier.
Size of the articles in bytes (¿and words in plain text without template codes?).
Quantity of users who have edited the page (anonymous/registered/total).
Quantity of references.
Sources used in each article.
Availability in another languages (sitelinks)
Number of links to the article.
Numbers of articles with talk pages.
Number of multimedia resources available in Wikimedia Commons.
Information about the Wikidata item.

The corpus of articles and the data of each article is collected in CSV files, to ease subsequent analysis of the data by the team. All this data will be published under free license to allow the reuse of the data.

Timeline

It is a ten months projects organized in two periods (one of four months and two weeks, and another one of five months a two weeks):

Development of the bot and curation of the data.

Manual revision of the first levels of the categories about the Canary Islands and preparation of the requirements to program the bot (1 month).
Bot programming in Python to make it extract and structure the data automatically. Tests to check that everything works correctly (2 months).
Curation (extracting and structuring) of the data in their respective CSV documents for the subsequent analysis (1 month and 2 weeks).

Analysis, evaluation, possible uses and advantages, and results presentation.

Analysis and evaluation of the data corpus obtained. (2 months).
Analysis of its usefulness, as well as its possible uses and advantages (1 month and 2 weeks)
Writing of a scientific document, a journal paper or a conference paper, to introduce and spread the project (motivation, precedents, methodology, results, similar projects, sources and data) in the academic and scientific field (2 months).

This timeline is flexible. The programming part may take more time than the specified due possible issues, more or less complicated to solve. In general, each part shown above has enough time and sufficient to redistribute between them if it is necessary.

Policy, Ethics and Human Subjects Research

For this project is very important to clarify the next:

The project is going to extract just the username of the main editor of the article in the Spanish Wikipedia. In any case it is going to be use to extract specific data of the users.
The tool, HAC{k}ProWiki, is not enabled to edit. Between the purposes of this project there is not one dedicated to improve the pages nor something else, it is just going to evaluate the current state qualitatively and quantitatively of the mentioned fields in the Spanish Wikipedia.

Results

Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.

Results of the phase 1

In the first phase of the project the results are divided in two parts. One related with the source code of the tool developed to run semiautomatically the curation of the corpus of articles and categories, and another one for the full corpus itself.

A preliminary image was obtained thanks to Quarry (see Table 1).

Table 1. Data obtained with Quarry
	Historia de Canarias	Cultura de Canarias
Categorías analizadas	51	169
Subcategorías (total de las analizadas)	66	269
Artículos (totales)	607	2182

After the obtaining of the articles to analyze it has been possible to determine the next key points:

There is very much redundancy between the categories, it means, subcategories and articles categorized in both. This means that in the data obtained with Quarry there are articles and subcategories that sums to both categories.
There isn't a proper organization in the categories. It has been observed subcategories and articles inefficiently overcategorized.
The cultural heritage and the history of the Canary Islands are both loosely represented.
Many of the articles included in the results of the queries performed to Quarry, mainly in the category "Cultura de Canarias", are not part of this study. This category was chosen to extract articles related with the cultural heritage, but many of them correspond with items that, with the criteria established, it has been considered adequately to not include them as cultural heritage.

Then we started to work on the analysis of specific data for each article with the purpose to get the data specified.

Results of the phase 2

The most remarkable result of this phase was the curation from the first one in which were collected the articles to analyze. This is the first corpus of data about the state of the knowledge about the cultural heritage and the history of the Canary Islands. The total amount reach the 1149 items compiled. This not only allows us to know what state they are in, but also to keep track of that same corpus and perform the necessary increased.

Table 2. Total amount of data extracted
Columnas por artículo	31
Cantidad de artículos	1149
Total	35619

The total amount of data reach the 35619 individual facts: one cell in the CSV, one fact (see Table 2). This calculation doesn't count the cells for the identifiers nor the header of the CSV.

Output

Publications

The methodology used in this research project, as well the results, the analysis performed and the corpus of data will be published soon.^{[notes 2]}

HAC{k}ProWiki

TODO.

The developed tool allow to perform easily a massive curation of the data to analyze. But due to matters of time it hasn't been possible to develop it as an easy-to-use too, so it is specifically for this project. In the next months one of the objectives is to develop a tool with an interface that allow the user to customize and curate the necessary data.

Subpages

These subpages expand the information about a specific phase to not overload the main research page.

Data corpus. Dedicated to the all the data is planned to extract during the execution of the project and the model of data in the CSV files.

Bibliography

This bibliography is the one used to present the project statement to the Fundación Universitaria de Las Palmas. Items in this bibligraphy might be or not used in the scientific content produced from this project:

Hill, Benjamin Mako; Dailey, Dharma; Guy, Richard T.; Lewis, Ben; Matsuzaki, Mika; Morgan, Jonathan T. (2017). "Democratizing Data Science: The Community Data Science Workshops and Classes" (PDF). In Sorin Adam Matei, Nicolas Jullien, Sean P. Goggins (eds.). Big Data Factories. Computational Social Sciences. Cham: Springer International Publishing. pp. 115–135. ISBN 978-3-319-59186-5. Retrieved 2019-03-01.
Miller, Julia (2014). "Building academic literacy and research skills by contributing to Wikipedia: A case study at an Australian university". Journal of Academic Language and Learning 8 (2): A72–A86. ISSN 1835-5196. Retrieved 2017-11-19.
Miquel-Ribé, Marc; Laniado, David (2019). "Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions". arXiv preprint. Retrieved 2019-03-01.
Nielsen, Finn Årup (2007). "Scientific citations in Wikipedia". First Monday 12 (8). ISSN 1396-0466. doi:10.5210/fm.v12i8.1997. Retrieved 2019-02-02.
Saorín, Tomás (2013). "Iniciativas GLAM-Wiki: Wikipedia como oportunidad para instituciones culturales". Anuario ThinkEPI 7: 78–85. Retrieved 2019-03-01.
Singer, Philipp; Lemmerich, Florian; West, Robert; Zia, Leila; Wulczyn, Ellery; Strohmaier, Markus; Leskovec, Jure (2017). "Why We Read Wikipedia". Proceedings of the 26th International Conference on World Wide Web - WWW '17. The 26th International Conference on World Wide Web. Perth, Australia: ACM Press. pp. 1591–1600. ISBN 978-1-4503-4913-0. doi:10.1145/3038912.3052716. Retrieved 2019-03-01.
Soler-Adillon, Joan; Pavlovic, Dragana; Freixa, Pere (2018). "Wikipedia en la Universidad: cambios en la percepción de valor con la creación de contenidos". Comunicar 26 (54): 39–48. ISSN 1988-3293. doi:10.3916/C54-2018-04. Retrieved 2019-03-01.
Suber, Peter (2012). Open access. MIT Press essential knowledge series. Cambridge, Mass: MIT Press. ISBN 978-0-262-51763-8.
Tramullas, Jesús (2015). "Wikipedia como objeto de investigación". Anuario ThinkEPI 9: 223–226. ISSN 2564-8837. doi:10.3145/thinkepi.2015.50. Retrieved 2019-03-01.
Tramullas, Jesús (2018). "Wikipedia, educación e información científica". Anuario ThinkEPI 12: 328. ISSN 2564-8837. doi:10.3145/thinkepi.2018.50. Retrieved 2019-03-01.
Vivaldi Palatresi, Jorge; Rodríguez Hontoria, Horacio (2011). "Extracting terminology from Wikipedia". ISSN 1135-5948. Retrieved 2019-07-16.

Notes

↑ This namespace, Anexo, is the equivalent to the pages dedicated to lists, like in the English Wikipedia the ones which begin with List of... or in the Portuguese Wikipedia with Lista de.... This namespace for the lists is something peculiar in some wikis, like in the mentioned one and in Euskera Wikipedia.
↑ The results section will be expanded when everything has been published.

References

[1] This namespace, Anexo, is the equivalent to the pages dedicated to lists, like in the English Wikipedia the ones which begin with List of... or in the Portuguese Wikipedia with Lista de.... This namespace for the lists is something peculiar in some wikis, like in the mentioned one and in Euskera Wikipedia.

[2] The results section will be expanded when everything has been published.

[notes 1]

[notes 2]