WikiCite 2016/Report/Group 5/Notes
Notes and links
editGoal
editWikiData will serve as a centralized, highly structured, repository capable of representing the highly networked nature of the scholarly sources that support the Knowledge archived across all Wikimedia projects. This signals an unprecedented opportunity for not only scientists and scholars but also society at large to explore the complex landscape of human knowledge. Yet, it is not clear what such an exploration would look like. What kinds of questions can be asked of such a system? With the generous support of CrossRef and the Sloan and Moore Foundations, the WikiCite 2016 Workshop established a working group to not only envision concrete use cases for scholarly source-related question in WikiData but also to determine whether the technical foundations required to effectively express those questions as intelligent, efficient, and systematic queries are in place. Where these technical foundations are lacking but needed, the working group tasked itself with developing proposals for overcoming such limitations.
This group focused on discussing and prioritizing use cases for wikidata queries involving source metadata. The assumption is that we already have all the required data. We also worked on obtaining a small open licensed bibliographic and citation graph dataset to build a proof of concept of the querying and visualization potential of having this data stored in Wikidata and exposed via SPARQL.
Notes
edit- See Proposal: Retrieving Wikidata statements by source
- Aim
- discuss and prioritize the most important types of source-related queries that WDQS should support
- determine if these queries can be effectively expressed in SPARQL and executed via WDQS or if they require a different indexing / data modeling strategy
Key properties
editProperties expressing a citation relation
edit- Stated in (d:property:P248)
- Main subject (d:property:P921)
- Published in (d:property:P1433)
- Imported from (d:property:P143)
- Cites (d:property:P2860).
The 'Cites' property was suggested, supported and created during the WikiCite 2016 meeting. The quick and bold creation promptly became a topic of discussion on Wikidata and a Wikidata user even suggested it for deletion. Nevertheless, meeting participants rapidly utilized the property to mark up a few scientific papers, so small citation networks could be visualized. This was particularly the case for scientific papers about the Zika virus and fever.
Other relevant properties
edit- PubMed ID (d:property:P698)
- subclass of (d:property:P279)
- author (d:property:P50)
- short author name (d:property:P2093)
See also
editExamples
edit- list all Wikidata statements citing a New York Times article
- e.g. d:Q191020
- list the most popular scholarly journals used as citations of statements for any item that is a subclass of economics
- retrieve all statements citing the works of Joseph Stiglitz (d:18430)
- retrieve all statements citing journal articles
- by physicists from Oxford University
- that have a PubMed Central ID
- list all statements citing a specific journal article that was retracted
- list all statements citing(WD) a source that cites(non-WD) a specific journal article ( or one that was retracted).
- this is outside the current scope of any Wikidata-related project, it requires storing scholarly citations between papers
all Zika-related journal articles(WD) that were published in the last n weeks Wikidata WikiProject Source Metadata: Items about Zika virus or fever
- coauthors of X
- requires storing bibliographic metadata for all publications by X
- coauthors of X in Wikipedia
- is there an interest for coauthors limited to sources cited in Wikipedia?
- *other examples of queries on citations restricted to Wikipedia would be more useful
- X's H-Index
- requires storing bibliographic metadata for all publications by X and all their citations
- Can citation links by typed?
- Citations restricted by their target
- Note: you cannot add qualifiers to sourcing statements, e.g. stated in (with a specific citation intention)
- How do we think about veracity on WD?
Use cases
editReuse source MD
edit- Example: look up a particular publication via a combination of free-form keywords, e.g. author, journal name, words in the title ('choosing experiements evans sociology')
- Is this something WDQS would be able to return? would a vanilla search API be more appropriate?
- Publication lists
- Example: all publications by Finn Arup Nielsen sorted by publication date
- requires storing biblio metadata for the entire publication record of a given author
- could potentially be implemented via a script periodically syncing up an author entry on Wikidata with the corresponding ORCID record
- could extend to bibliographies/ reading lists of all types
- knowledge wells (return the developed scholarship from an arbitrary 'community' [e.g. individual, lab, department, division, university, company]).
- custom curriculums
- Example: all publications by members of a given lab
- ORCID supports affiliations as free-form text, Wikidata has the benefit of supporting affiliations via linked data
- Example: all publications supported by grants from a specific funder
- Overlaps potentially with Crossref data (funderID)
Sanity Checks
editThis is mostly targeted at data producers / source owners
- Example 1:
- bot scraping data about proteins and storing sources on Wikidata
- used to reference text, but created errors referencing synonyms, e.g. Ebola River (Q934455) instead of Ebolavirus (Q5331908)
- Example 2:
- graph representations surfacing type/class errors, e.g. US states sharing borders used to return items that are not an instance of a state link
- Example 3:
Federated Wikibase Queries
edit- run queries across multiple data providers
- analyze data quality by comparing results from separate providers
Generating a test case
editWe decided to identify a corpus of references to explore the feasibility of importing them and using them as sources for existing Wikidata items. Requirements for this dataset are the following:
- size: the corpus should have a fairly small number of nodes (articles)
- relevance: the corpus should fill some obvious gaps, such as serving to directly source statements in Wikidata
- PID-ready: the corpus should have clean metadata derivable from persistent identifiers (DOIs or PMIDs)
Obtaining a dataset
edit- Bibliographic records
- Zika dataset
- Number of nodes:
- Pubmed search with "Zika" returns 883 articles https://www.ncbi.nlm.nih.gov/pubmed?term=zika
- List of Pubmed ID resulting from https://www.ncbi.nlm.nih.gov/pubmed/?term=zika+virus[Mesh+terms]+OR+zika+fever[Mesh+terms]
- https://gist.github.com/konrad/341d1b8af1fd602f0f881bcc53c540ab
- Ebola Dataset
- (all Pubmed records with "Ebola" in the title or abstract) -- Available Now here: https://www.dropbox.com/sh/gh5ckftwhvt7pao/AAD6-tXO_Kz-QbphFhUmUCG0a?dl=0
- Number of nodes: 1,600 records.
- ebolaDocs.csv is a list of the records with the above this delimited schema: id pmid issn year vol issue journal journalAbbrev journalCountry journalNlmID articleTitle (just from looking through, this document seems to have some noisy stuff in it. Shouldn't matter because we're only after like 50 records).
- ebolaAuthors is a list of the authors with this delimited schema: id pmid rank LastName FirstName initials
- ebolaWD is a list of the entities currently in WikiData (some of which are missing pmids) generated with this query SPARQL
- Wikidata scientific articles that contain ebola in title
- APS dataset
- Zika dataset
- Citation graph
- APS (DOI prefix: 10.1103, i.e. papers like http://dx.doi.org/10.1103/PhysRevFluids.1.013903 )
- PubMed: full pubmed - Waiting
- PubMed: Ebola Dataset - Waiting
Mapping records to existing WD items
edit- Zika dataset
- Wikidata WikiProject Source Metadata: Items about Zika virus or fever
- identify items that
- are an instance of (d:property:P31) scientific article (d:Q13442814)
- have Zika virus (d:Q202864) or Zika fever (d:Q8071861) as Main subject (P921)
- PubMed search: https://www.ncbi.nlm.nih.gov/pubmed?term=zika%20virus[Mesh%20terms]%20OR%20zika%20fever[Mesh%20terms]%20&sourceid=mozilla-search
- are orphaned, i.e. are not currently used as a source in any Wikidata statement Query: SPARQL
- alternatively, identify all items that are orphaned Query: SPARQL
Tools for importing data and curating it
edit- source MD: https://tools.wmflabs.org/sourcemd/?
- takes a DOI or PMID or PMCID as input and generates an item, using the data model specified by source MD
- documentation: https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData/Source,_M.D./Tests#Methodology
- quick statements tools: https://tools.wmflabs.org/wikidata-todo/quick_statements.php
Curation
editAsk the GeneWiki community to help crossreference this corpus with existing Wikidata items
Running samples queries and visualizations
edit- Zika dataset
- existing visualizations
- timeline
- listeria-generated list of references
- graph visualizations
- queries
Reference material
edit- Example queries
- What is a "statement"?
Proposal
edit- Import an entire corpus of bibliographic metadata and citation graph for a given field
- Show all kinds of queries / visualizations that can be obtained via WDQS
- Source: Pubmed? Mendeley? American Physical Review?
- Mendeley contacted 25 May
- American Physical Review contacted 25 May
Example queries
editSee also
- Wikidata items that are instances of (P31) scientific article (Q13442814) and have a PMID (P698) or PMCID (P932): SPARQL
- Wikidata items that are instances of scientific article (Q13442814) but do not have a PMID (P698) or PMCID (P932): SPARQL
- Wikidata statements that have scientific papers as references, specifically Wikidata items with statements involving a PMID (P698) or PMCID (P932): SPARQL
- Wikidata statements that have scientific papers as references, specifically Wikidata items that are instances of scientific article (Q13442814) but do not have a PMID (P698) or PMCID (P932): SPARQL
- Most common Zika author strings: SPARQL
- Didier Musso is the most frequent amongst them https://www.wikidata.org/wiki/Q24244119
- Wikidata scientific articles that contain "zika" in the title SPARQL
- Example citation network for Zika research papers: https://angryloki.github.io/wikidata-graph-builder/?property=P2860&item=Q23906890&iterations=5&mode=undirected
- Another example citation network for Zika research papers https://angryloki.github.io/wikidata-graph-builder/?property=P2860&item=Q23308149&mode=both
Results
edit- An example of how the set of articles can be used in Wikidata d:Q202864 This is the entity for Zika virus, we added sources for several of the statements that had been empty.
- A property 'cites' (d:property:P2860) was created to model citation events between documents. It
- Up to and after the meeting Finn Årup Nielsen created Wikidata item for all papers associated with data in the OpenfMRI neuroimaging database (d:Q23891141).
For Andra
edit- Zika virus @ BioProject from the National Center for Biotechnology Information (NCBI)