WikiCite 2016/Report/Group 4/Notes

Notes and links edit

see also StrepHit 1.0 Beta Release
see also IEG Grant – StrepHit: Wikidata Statements Validation via References

Goal edit

  1. Play with the current StrepHit dataset: biographies in English; DONE
  2. create and fill a Request for Comments; DONE
  3. encourage referenced data donations through the primary sources tool: DONE
  4. Follow up on past discussion with ContentMine and Hypothes.is people: DONE

Notes edit

  • primary sources tool: editing the statement if something's wrong
  • Till with genomic datasets
  • domain-specific use cases == domain-specific curation tools
  • different colors
  • use hypothes.is API to highlight extracted sentences
  • ContentMine

it would be great if you could add a statement of interest about ContentMine's potential data donation via the primary sources tool here (feel free to add a new section of course): https://meta.wikimedia.org/wiki/Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline

Instructions to upload a dataset to the primary sources tool:

  1. format your data in the QuickStatements syntax, documentation at http://tools.wmflabs.org/wikidata-todo/quick_statements.php
  2. ping me for an API access token
  3. upload the dataset through the following API endpoint
https://tools.wmflabs.org/wikidata-primary-sources/import
Documentation at https://github.com/google/primarysources/tree/master/backend#import-statements

Alternatively to points 2 and 3, you can just give the dataset to Hjfocs and he will upload it directly.

Data modeling, i.e., from ContentMine extraction results to the QuickStatements dataset.

Each statement is composed of:
A. subject = given the extracted named entity, look up the subject Wikidata Item ID via
A.1. SPARQL
A.2. API endpoint: https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities
B. property = d:property:P248 'stated in'
C. value = item ID of the source, e.g., d:Q229883 for PubMed Central
D. reference URL = d:P854

Side notes

  • FrameNet lexical database for N-ary relation extraction:

https://framenet.icsi.berkeley.edu/fndrupal/

  • Adam: references collected from Microdata
    • especially for movies
    • google custom search for specific microformats (cf. Sindice)