Grants:Project/Harej/Librarybase: an online reference library/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2017-18 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/Harej/Librarybase: an online reference library.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.

Summary

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

I figured out the relationship between Wikidata and the Librarybase wiki. We want as much on Wikidata as possible, while the Librarybase wiki is still available for more experimental data modeling.
I did a lot of work to improve data quality on Wikidata items about journal articles.
I optimized scripts to run faster and keep up with Wikidata as it grows.
I have been talking to people and asking them things.

Methods and activities

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

Research and planning

While the Librarybase wiki, which uses the same software as Wikidata, is a resource for storing records about works, it is my preference to put as much as possible on Wikidata so that it could be better integrated with the data of that project. I consulted with WikiProject Source Metadata on what would be "in scope" for Wikidata, proposing to create entries on open-access review articles. While there is no fixed consensus on this at the moment, there is general agreement that review articles, as well as journal articles cited on Wikipedia, are acceptable for inclusion on Wikidata. (Other documents were not ruled out; they just did not come up in discussion. Indeed, Wikidata editors have contributed many small batches of Wikidata items about journal articles and other such publications, and a ProteinBoxBot creates items for journal articles cited on Wikidata as part of a larger project.)

With that in consideration, the Librarybase project will take a "Wikidata-first" approach, where Wikidata is a preferred destination for well-modeled source metadata, while the Librarybase wiki will be used for more "experimental" modeling projects, such as with web pages. Thanks to work by the Wikimedia Foundation, the Wikidata Query Service now allows queries to include data from multiple sources. This will allow us to make a query to both Wikidata and Librarybase simultaneously from a common endpoint.

The discussion also raised an important consideration: if the long-term goal is to have citations served from a central database, it is important that this does not affect the formatting of articles. As such, I will make sure that the citation database does not lose any data in the process of storing citations. This way, we can have structured data for citations without disrupting existing Wikipedia articles.

Considering this, Librarybase will proceed with two separate databases: one to store citation events (i.e. a citation of a source on Wikipedia) and another to represent works (the documents being cited). For works, Wikidata and the Librarybase wiki will serve as a community-editable data store, while the specialized citation event database will connect to entries in Wikidata/Librarybase while also faithfully representing the citations as they appear on Wikipedia.

In addition:

I have been looking into applying an existing bibliographic ontology to Wikidata to allow for interoperability between Wikidata and other databases. I began work on integrating the SPAR ontologies with Wikidata.
I have been working with researcher John Bohannon to develop a service to recommend sources for articles.
The Internet Archive may be a possible partner for recommending sources as well, based on their massive holdings.
As for Wikipedia Library integration: best approach for now is to identify recommended documents through other means – rather than systematically retrieve lists of sources from the partners – and then ping the respective partner databases to see whether the document exists in their holdings.

Quality control

Quality control was a significant consideration, since large-scale creation of Wikidata items can also result in large-scale introduction of error. My work on quality control focused largely on cross-referencing works between different databases, including PubMed, PubMed Central, and Crossref. The main issue is that these databases don't always "talk" to each other, meaning that two Wikidata entries could be created on the same article and we would be none the wiser. Through a number of strategies, including querying PubMed's identifier conversion API and doing lookups on Crossref based on pieces of metadata like article title and volume/issue numbers, we have been able to associate identifiers with other identifiers, root out thousands of duplicates, and prevent the creation of other duplicates.

One other cause of duplication involves a quirk with the DOI specification. DOI, or digital object identifier, is not case-sensitive, while Wikidata is. This means that Wikidata would consider "10.1000/abc" and "10.1000/ABC" to be two separate things, even though in reality they should resolve to the same document. This resulted in the creation of many duplicate entries owing to different capitalization styles. However, working with WikiProject Source Metadata, I forged a consensus around normalizing DOIs to use all capital letters, getting around this issue.

I also worked on scripts for filling in missing fields such as article title, date of publication, and "published in [journal]" statements. These scripts have had a significant impact on Wikidata constraint violation reports, as reported below.

As part of an early exploration of quality control issues, I came up with an idea for a Wikidata Content Schema – a structured document describing what is required of certain classes of Wikidata items. There was not much appetite for it so I have not pursued it further.

Software development

I have been optimizing scripts I had written earlier so that they run faster and scale better with respect to a growing Wikidata. These scripts carry out the aforementioned data quality work, as well as work identifying relationships between documents (the "Citation Graph") and scripts to create massive amounts of new Wikidata entries based on a list of identifiers. Sebotic's Wikidata Integrator has been key for these optimizations, since it allows the scripts to save the edits directly to Wikidata, as opposed to creating data to feed into another tool like Quick Statements. Caching API lookups in Redis has also greatly reduced the amount of repetitive queries I've had to make, much to the relief of PubMed and Crossref I am sure.

Due to personnel shortages, work on extracting citations from Wikipedia has been put on hold. However, using Aaron Halfaker's identifier extraction script, I have been experimenting with some programmatic Wikidata item creation. However, the report is generated infrequently and it has lots of issues, including lack of association between identifier and work (i.e. the same work will be referred to by multiple identifiers). A proposed solution for this is a centralized database of identifiers, such that you give one identifier and get the others in return, designed to return results very quickly.

Related software projects:

Midpoint outcomes

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

Improved data quality, as measured in number of constraint violations. Wikidata uses constraint violations to detect missing statements from entries, or statements that are illogical.
- Among items with DOIs:
  - November 1, 2016: 866 violations across 249,946 entries: 63 format errors, 579 single value violations, 224 unique value violations, 0 qualifier violations
  - May 9, 2017: 225 violations across 605,198 entries: 21 format errors, 178 single vlaue violations, 25 unique value violations, 1 qualifier violation
- Among items with PubMed IDs:
  - November 1, 2016: 23,558 violations across 249,951 entries: 101 type violations, 0 format violations, 2 single value violations, 381 unique value violations, 6635 title violations, 13279 published-in violations, 3160 publication date violations, 0 qualifier violations
  - May 9, 2017: 6,007 violations across 598,919 entries: 105 type violations, 0 format violations, 65 single value violations, 5 unique value violations, 850 title violations, 4784 published-in violations, 198 publication date violations, 0 qualifier violations
- Among items with PubMed Central IDs:
  - November 1, 2016: 26,958 violations across 187,102 entries: 18 type violations, 0 format violations, 0 single value violations, 106 unique value violations, 6332 title violations, 14725 published-in violations, 5777 publication date violations, 0 qualifier violations
  - May 9, 2017: 11,364 violations across 339,643 entries: 20 type violations, 0 format violations, 10 single value violations, 0 unique value violations, 2813 title violations, 5753 published-in violations, 2768 publication date violations, 0 qualifier violations
  - (Many of these violations are from "stub" entries that are nothing more than "instance of scientific article" and a PMCID – this highlights the need to fill out incomplete entries)
Richer Wikidata entries as a result of scripts incorporating information from other databases. Example
- Scripts to associate PMCIDs, PMIDs, and DOIs with each other, as well as "discovering" the DOI of a work based on a few pieces of metadata. This was key for identifying Wikidata entries created on journal articles but without identifiers. Methodically adding identifiers to Wikidata entries means preventing the creation of duplicate entries and integrating Wikidata with other datasets.
- A growing citation graph, now at 3.3 million connections. Identifying citation relationships between documents allows us to identify influential works that could be cited on Wikipedia. The associated I4OC initiative has been successful in liberating citation data on Crossref, meaning we now have an additional source of data.
- These scripts have been optimized to run quickly, allowing them to keep up with a growing Wikidata.
Progress on creating a scalable system to document sources on Wikidata as they are used on Wikipedia. More work needs to be done but we are getting closer to the goal of documenting Wikipedia source usage in real time.

Finances

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far? Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

We are currently under-budget as a result of being understaffed. We are looking into making adjustments to our timeline and adjusting the budget accordingly.

Learning

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

Staffing changes – having to find new people in the middle of the project. I am not sure what the best solution to this is, other than trying to find a replacement as quickly as possible, which I have thus far not succeeded in doing.

What is working well

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Grants:Learning patterns/Wikidata mass imports

Next steps and opportunities

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.

Further development and systematization of the Librarybase scripts, with the goal of having them run automatically
Setting up the identifier cross-reference database
Beginning to pilot with Wikipedia citation extraction and citation database development
Cleanup and update of the Librarybase wiki
With infrastructural work done, user research and putting together requirements document for reference recommendation platform, ideally with a prototype by the end of the grant period

Grantee reflection

We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

I've been trying to do this while also wrapping up WikiProject X, and this, combined with my other obligations, has been rather difficult. However WikiProject X is pretty much done for now, so I hope I can do more to make Librarybase a reality. And I am looking forward to seeing what is coming out of WikiCite 2017!