Research:History and cultural heritage of the Canary Islands in the Wikimedia projects

GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


History and cultural heritage of the Canary Islands in the Wikimedia projects. Uses and advantages of the scientific knowledge transfer to free knowledge projects (official title, in Spanish: Historia y patrimonio cultural de Canarias en los proyectos Wikimedia. Usos y ventajas de la transferencia de conocimiento científico a proyectos de conocimiento libre). This project consists mainly in the analysis of state of the knowledge about the cultural heritage and the history of the Canary Islands there are in the Wikimedia projects, mainly in Wikipedia in Spanish, but in Wikimedia Commons and Wikidata too. In addition, it analyze the possible uses and advantages of a digital transfer of scientific knowledge to projects with free licenses that allow to reuse and spread their content legally and freely.


ObjectivesEdit

The main objective of this project is to know the knowledge about the cultural heritage and the history of the Canary Islands in the Wikimedia Projects, specially in Spanish Wikipedia, Wikimedia Commons and Wikidata. Once this objective is achieved it would be possible to:

  • Evaluate the quantity and the quality of the knowledge.
  • Consider how and when the articles were created and what has been their evolution.
  • When were more articles created?
  • Does the interest in the elaboration of these articles increase or decrease?
  • Does the quantity of articles and the quality of them has been increased from their creation?
  • Consider the uses and advantages of the scientific knowledge transfers.
  • How this transfers could help to the improvement of the current content in the Wikimedia projects?
  • How could the institutions and individuals ease the digital and free access to knowledge to improve the current content?

MethodsEdit

The essential parts of this project are the ones dedicated to the technical aspects. It is developing a command line interface with Python for the data extractions: the one dedicated to collect the corpus of articles itself and the another one dedicated to the extraction of data referred to each page.

There are different ways to address this (e.g., analyze the official Wikipedia dumps), but the one chosen for this project is to review the pages of a set of categories. The categories are subcategories of the categories Cultura de Canarias (Culture of the Canary Islands) and Historia de Canarias (History of the Canary Islands).

HAC{k}ProWiki, a tool to perform the technical tasksEdit

Thus during the project is being developed a tool, named HAC{k}ProWiki temporarily, to extract a corpus of pages in the content namespaces of the Spanish Wikipedia, the main one and Anexo.[notes 1] This tool is developed in Python, using Pywikibot as a bridge to contact with the Wikipedia API. The main functions of the tool are:

  • Check the pages in the categories chosen, show to the operator and give the options to add it to the corpus (a CSV file) or discard it. Each page is included with an identifier.
  • From each page of the corpus (a row), extract the data (see Data corpus) in another corpus (CSV file).

Once the project is finished, the process will be described with details and will be exposed the benefits and problems using the tool. In addition, it is probably that the tool will continue under development to make it available to anyone who wants to perform similar tasks.

Data corpusEdit

We want to get at least the next data:

  • Date of creation with the identifier (oldid) to the first version.
  • Date of last edit with its identifier.
  • Size of the articles in bytes (¿and words in plain text without template codes?).
  • Quantity of users who have edited the page (anonymous/registered/total).
  • Quantity of references.
  • Sources used in each article.
  • Availability in another languages (sitelinks)
  • Number of links to the article.
  • Numbers of articles with talk pages.
  • Number of multimedia resources available in Wikimedia Commons.
  • Information about the Wikidata item.

The corpus of articles and the data of each article is collected in CSV files, to ease subsequent analysis of the data by the team. All this data will be published under free license to allow the reuse of the data.

TimelineEdit

It is a ten months projects organized in two periods (one of four months and two weeks, and another one of five months a two weeks):

  1. Development of the bot and curation of the data.
  1. Manual revision of the first levels of the categories about the Canary Islands and preparation of the requirements to program the bot (1 month).
  2. Bot programming in Python to make it extract and structure the data automatically. Tests to check that everything works correctly (2 months).
  3. Curation (extracting and structuring) of the data in their respective CSV documents for the subsequent analysis (1 month and 2 weeks).
  1. Analysis, evaluation, possible uses and advantages, and results presentation.
  1. Analysis and evaluation of the data corpus obtained. (2 months).
  2. Analysis of its usefulness, as well as its possible uses and advantages (1 month and 2 weeks)
  3. Writing of a scientific document, a journal paper or a conference paper, to introduce and spread the project (motivation, precedents, methodology, results, similar projects, sources and data) in the academic and scientific field (2 months).

This timeline is flexible. The programming part may take more time than the specified due possible issues, more or less complicated to solve. In general, each part shown above has enough time and sufficient to redistribute between them if it is necessary.

Policy, Ethics and Human Subjects ResearchEdit

For this project is very important to clarify the next:

  • It is probably that one of the data to collect will be the main editor of a page and then, make a top of editors in this field. However, it is not sure yet if the usernames should be shown in the corpus or if we should replaced them for an alias (e.g., U1, U2, U3). To clearify this situation, it is going to be asked to the community if this might be a disruption of their privacy and comfort in the project.
  • In addition to the previous point, the project is not going to extract the username nor another data related with the users in Wikimedia Commons, Wikidata nor another Wikimedia projects.
  • The tool, HAC{k}ProWiki, is not enabled to edit. Between the purposes of this project there is not one dedicated to improve the pages nor something else, it is just going to evaluate the current state qualitatively and quantitatively of the mentioned fields in the Spanish Wikipedia.

ResultsEdit

Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.

SubpagesEdit

These subpages expand the information about a specific phase to not overload the main research page.

  • Data corpus. Dedicated to the all the data is planned to extract during the execution of the project and the model of data in the CSV files.

BibliographyEdit

This bibliography is the one used to present the project statement to the Fundación Universitaria de Las Palmas. Items in this bibligraphy might be or not used in the scientific content produced from this project:

NotesEdit

  1. This namespace, Anexo, is the equivalent to the pages dedicated to lists, like in the English Wikipedia the ones which begin with List of... or in the Portuguese Wikipedia with Lista de.... This namespace for the lists is something peculiar in some wikis, like in the mentioned one and in Euskera Wikipedia.

ReferencesEdit