The result of this project will create a software tool for Wikipedia to make a dataset in a relational database format (MySQL) for all citations across Wikipedia. The software will iterate through a dump of Wikipedia pages from the dumps site, examine the text of each article, parse out the references and save them accordingly to a database. The database will retain the first instance of a citation by it's unique identifier; prioritized and determined by what is available in the reference (doi, isbn, url, etc.). The goal is to not have duplicates of a citation that has the same identifier. When citations that have the identifier are found later they will be flagged in the database referencing the existing original citation and the article that they appear on but no further information will be committed.
Having a central database will allow for researchers and people who wish to improve Wikipedia to easily examine the way citations are being used (in terms of credibility and health) across the whole of Wikipedia, perform maintenance on citations, easily connect references to Wikidata, and many other applications.
The desire for creating a central database for references was mentioned on this page from 2007. A similar project, Understanding the context of citations in Wikipedia, has been executed but the difference is that this is an attempt to create a tool of a central database whereas the project by Kim et al examined quality of articles and the quality of citations resulting in a dataset of all citations.
I will be writing software based on the existing open-source project Print Wikipedia. I was the lead developer of the Print Wikipedia project from 2013 to 2016 and have intimate knowledge of the code used. Progress can be monitored on the Github page during the course of the project. I will keep to a strict schedule that will involve:
- Determine what aspects of citations should be relevant to the project.
- Unique identifiers, details of the source, not the retrieval date, etc.
- Design structure of the reference database.
- Article table, reference table, table mapping relations between the two with column for duplicate flag of references.
- Caching database (Redis) to run in tandem with relational database (MySQL) for faster lookups while creating the relational database.
- Examine existing Print Wikipedia code and decide what is necessary and what is not. Strip those things away, testing while doing so to see if the code continues to be executable.
- Working with a smaller dataset for testing purposes, add in code for examining citations specifically.
- Develop the code using trial and error testing when executing carefully examining all sorts of citation styles.
- Once completed query the database to explore the results to see if everything is functioning correctly and make appropriate changes.
- Apply a full dump of Wikipedia to be tested against the code.
- Again go through database to explore results and make changes if things are not functioning properly. If everything is fine analyze results to make conclusions about the data. An easy metric to determine would be the most used source across Wikipedia.
- Write a blogpost describing the results.
- Publish this blogpost to relevant sites that would be interested (reddit, hackernews, etc.)
I would like to encourage members of the Wikipedia community to engage with this project's talk page to suggest concerns and aspects over the idea of a central database for references. I will be keeping this project as open source on Github. Other users will be able to see the project from there and make improvements and submit their improvements to be part of the main code base so the project will continue to evolve and improve. I will also submit the finished project with some findings that I thought were interesting, both about the process and result, to various tech blogs and communities such as hackernews and the Wikipedia subreddit.
I believe these to be fertile environments for collaboration and creating buzz around grassroots projects like this and feel that they will be able to both understand and interface with the project positively.
At the end of the project I will have created software that is able to scan through a Wikipedia pages data dump and organize all cited sources into a central database. Both the code in order to perform this with instructions as well as an the dataset of the first iteration that I will run will be publicly available.
I am the sole person working on this project but I will likely be asking other Wikipedians and technologists for advice along the way. I will update them to the result of my project linking to the Github or my findings when I publish them.
There will be one participant in this project. No articles will be created and none will be improved immediately by this project, however the tool that is created as a result of this project has the potential to strengthen the citations and footnotes for each article on Wikipedia.
The project will be a success if there is a central database for references and citations of Wikipedia and have others use it for academic research and making improvements to citations on Wikipedia.
Resources I have:
- Intimate knowledge of the existing code base of Print Wikipedia
- Strong working knowledge of the Wikipedia dump and MediaWiki data structures
- A working computer
Resources I will need:
- More RAM
- 16GB of RAM for Lenovo T530 = $150
- 92 hours at $20/hr over 11 weeks