ORCID cleanup (with National Library of the Czech Republic authority files)
- your name and/or Wikimedia username
- Jiří Sedláček (User:Frettie)
- your contact e-mail
- jiri.sedlacek wikimedia.cz
- your nearest city and country
- Prague, Czech Republic
Details of team members (optional)
- Jiří Sedláček, User:Frettie (Prague), role: ingestion of data, iterative matching
- Vojtěch Dostál, User:Vojtěch Dostál (Prague), role: guiding the whole process, building on expertise with these datasets
- Benjamín Skála, User:Ben Skála (Brno), role: evaluation of identified matches, providing feedback for data analyst
There are many human items on Wikidata that are automatically created to be linked from publication items – they often only consist of an English label and an ORCID identifier value. At the same time, we previously managed to release a National Library database of authority files under a CC-0 license (approximately 1 million records). Many of them were imported or could be imported, but we need to prevent duplication as much as possible.
Our aim is to obtain all items with an ORCID record, but without the Czech National Library ID, and try to make a connection between them using the first and last name. Usually, this would not a very good way to match person items to a database, but there is currently no other option because the ORCID-based items do not contain very much useful information which could be used to identify duplicates.
The result will be a table of possible links between ORCID items and entries in the Czech National Library. We will iteratively go through them manually and improve our script to minimize false positives and maximize correct identifications. Those Library identifiers which already have items will be merged while ORCID items without National Library IDs will be enriched with a connection to the National Library, thus putting Czech researchers into a wider bibliographical context. We plan to provide the remaining unpaired suggestions as a matching tool or a wikitable to the community for further volunteer work.
This project differs from our normal activities by the sheer complexity and number of records that need to be connected and paired. It also involves a lot of data cleaning and other considerations which require time.
In future, more ORCID items will definitely be created in Wikidata. We will be able to simply repeat this process without having to invent the wheel again.
- Vojtěch Dostál / User:Vojtěch Dostál: I have been involved in several collaborative projects with the National Library of the Czech Republic, mostly as a volunteer project manager. Last year I worked on a project to release a large freely-licensed dataset of Czech bibligraphical data and to import it into Wikidata. In the last month I also worked on manual de-duplication of person items of Czech researchers so I have gained good knowledge of the dataset currently present in Wikidata.
- Benjamín Skála / User:Ben Skála: I am a manual editor and I have approx. 200 000 of manual edits on Wikidata for past four years (adding new data, creating new items, merging, adding references to existing data, etc.). My role is systematically going through robotically added records inserted by Frettie and Vojtěch Dostál and correct them manually if needed.
Proposed activity datesEdit
Probably four days, two weekends at 2021 spring.
Optional: Community members are encouraged to endorse your proposal and leave a rationale here.
- The proposed project has the potential to significantly enhance the perception of Wikidata as a valuable data source for enriching library catalogs/databases in the Czech Republic and beyond. As the ORCID identifier is widely used in the academic community, I expect that the project results will eventually help in other data reconciliation/matching projects and save time and effort needed for implementing such projects. I appreciate that the project seeks to employ matching algorithms but also puts an emphasis on manual data quality check. To sum up, from my position of an information specialist working for the National Library of the Czech Republic, I endorse the project. Linda.jansova (talk) 12:22, 29 September 2020 (UTC)