University of Virginia/Machine learning for MeSH terms

This page documents a planned research project.
Information may be incomplete and change before the project starts.


Problem edit

(summary from proposal) As highlighted in the National Library of Medicine (NLM)’s Strategic Plan 2017-2027, curation at scale is an important research direction to accelerate the availability of and access to secure, complete data sets and computational models that can serve as the basis of transformative biomedical discoveries.

Currently, human experts examine the full body of each biomedical article and annotate it with suitable concepts according to the biomedical thesaurus they are using. This manual biomedical labeling has high accuracy but inevitably comes at a high price. It is estimated that on average annotating one biomedical article in MEDLINE costs around $9.4 [48], and in 2017 more than 813,500 citations were added to MEDLINE. In addition to the monetary cost, it is also time consuming for the human experts to label a newly published article. Thus, it would be very helpful to develop an automatic system capable of labeling biomedical articles with predefined biomedical thesauruses.

To achieve NLM’s goal of making the digital data assets more findable, accessible, interoperable, and reusable (FAIR), we propose a novel end-to-end biomedical text MeSH indexing model based on deep learning and attention mechanism to conduct automatic indexing on large-scale biomedical literature. With this model, we propose to design a unified self-contained classifier (termed MeSHCurator) which is able to automatically extract general and topic-related biomedical information from biomedical articles in a combination of self-supervised and supervised fashion. In addition, to achieve NLM’s goals of making the digital data assets more accessible, interoperable and reusable, we propose to integrate MeSH indexing with Wikidata, which is among the most popularly reused datasets developed by the Wikimedia Foundation. To achieve the widest possible dissemination and public impact, we will incorporate selected features from Wikimedia platforms (e.g. Wikidata identifiers), which is FAIR and open to export query results [6], into MeSHCurator, which will make PubMed content more accessible to the global community of one billion annual users who view and reuse Wikimedia content.

Objective edit

  1. Consider all PubMed-index papers
  2. Use technology to produce MeSH terms to describe these papers
  3. Publish these terms to the Wikidata catalog of PubMed papers
  4. Develop functionality which supports exploration of Wikidata's copy and adaption of the PubMed catalog

Data edit

Deliverables edit

  1. a novel collection of MeSH labels applied to PubMed indexed papers
  2. publication of the same in Wikidata
  3. Development of the Wikidata environment which increases access and use of this data
    1. Queries for browsing academic literature in Wikidata
    2. Multilingual translations of MeSH terms
    3. Integration with other Wikidata scholarly cataloging efforts, including the Wikicite project, author disambiguation, and association of papers and research with the author's institutions

Research Team edit