Grants:IEG/StrepHit: Wikidata Statements Validation via References/Timeline


Timeline for StrepHit: Wikidata Statements Validation via References edit

Timeline Date
Development Corpus April 11 2016
Candidate Relations Set April 11 2016
StrepHit Pipeline Beta June 11 2016
Production Corpus July 11 2016
Web Sources Knowledge Base July 11 2016


Overview edit

Monthly updates edit

Each update will cover a 1-month time span, starting from the 11th day. For instance, January 2016 means January 11th to February 11th 2016.

January 2016 edit

Dissemination activities edit

  • Jan 15: Kick-off seminar at FBK, Trento, Italy
  • Jan 20: Talk at the event Web 3.0, il potenziale del web semantico e dei dati strutturati, Lugano, Switzerland

Sources identification edit

We identified 3 candidate domains that may serve as good use cases for the project:

  • Biographies
  • Companies
  • Biomedical literature
Domain Reasons
Biographies
  • plenty of existing data
  • broad coverage
  • potentially easy to find valuable primary sources
  • perfect fit for the current prototype
Companies
  • relatively biased domain
  • ad-prone content
  • the company edits the page on the company itself
  • low-quality data
Biomedical
  • great primary source, i.e., PubMed
  • proof of usage for an Open Access corpus
  • complex implementation

We have gathered feedback from different communities (Wikimedians following our seminars, GLAM), which seem by far to prefer the biographical domain. Hence, we have selected it and harvested the list of primary sources.

Biographies edit

The mix'n'match tool maintains a list of biographical catalogues, which can serve as reliable sources candidates. The table below displays an informal analysis of the catalogues available in English:

Source URL Comments Candidate?
Oxford Dictionary of National Biography [1] subscription needed to access the full text; full version available on Wikisource maybe
Dictionary of Welsh Biography [2]   Support
Dictionary of art historians [3]   Support
The Royal Society [4] few entries (< 1,600)   Support
Members of the European Parliament [5] almost no raw text   Oppose
Thyssen-Bornemisza museum [6]   Support
BBC your paintings [7]   Support
National Portrait Gallery [8] almost no raw text   Oppose
Stanford Encyclopedia of Philosophy [9] not only biographies (encyclopedia)   Support
A Cambridge Alumni Database [10] lots of abbreviations   Oppose
Australian dictionary of biographies [11]   Support
General Division of the Order of Australia [12]   Semi-structured
The Union List of Artist Names [13] Structured data (RDF) with public endpoint   Structured
AcademiaNet [14]   Semi-structured
Appletons' Cyclopædia of American Biography [15] from Wikisource   Support
Artsy [16] commercial web site, as suggested by Spinster   Oppose
British Museum [17] Structured data (RDF) with public endpoint   Structured
Bénézit Dictionary of Artists [18] subscription needed maybe
French theatre of the seventeenth and eighteenth centuries [19] few data   Semi-structured
Catholic Hierarchy Bishops [20] microdata   Semi-structured
China Vitae [21] lots of items have no actual biography   Support
Cultural Objects Name Authority [22]   Semi-structured
Cooper Hewitt [23] API available (seems not to return all the text that appears in a person's page though)   Support
Design&Art Australia Online [24] must browse to the biography tab for full text   Support
Database of Scientific Illustrators [25]   Semi-structured
The Dictionary of Ulster Biography [26]   Support
Encyclopedia Brunoniana [27] not only biographies (encyclopedia)   Support
Early Modern Letters Online [28] no biographies   Oppose
Global Anabaptist Mennonite Encyclopedia Online [29] third-party wiki   Support
Genealogics person ID [30] secondary source resulting from personal research   Semi-structured
The Hermitage - Authors [31] no biographies   Oppose
LoC artists [32] no biographies   Oppose
MOMA [33] no biographies   Oppose
MSBI [34] short utterances (may be hard to parse)   Semi-structured
MUNKSROLL [35]   Support
Metallum bands [36]   Support
National Gallery of Art [37] no biographies   Oppose
National Gallery of Victoria [38] no biographies   Oppose
National Library of Ireland [39] no biographies   Oppose
Notable Names Database [40]   Support
Belgian people and things [41] no biographies   Oppose
Open Library [42] no biographies   Oppose
ORCID [43] biographies may be missing maybe
OpenPlaques [44] no biographies   Oppose
Project Gutenberg [45] no biographies   Oppose
National Library of Australia [46] actually links to other catalogues   Oppose
Smithsonian American Art Museum [47] no biographies   Oppose
Structurae persons [48]   Semi-structured
Theatricalia [49] no biographies   Oppose
Web Gallery of Art [50] lots of embedded frames in pages; commercial web site, as suggested by Spinster   Oppose
Parliament UK [51]   Semi-structured
Catholic Encyclopedia (1913) [52] Wikisource; not only biographies (encyclopedia)   Support
Baker's Biographical Dictionary of Musicians [53] full raw text available   Support
RKDartists [54] suggested by Spinster   Semi-structured

The table below instead shows a list of Wikisource sources, as per wikisource:Category:Biographical_dictionaries and wikisource:Wikisource:WikiProject_Biographical_dictionaries. @Nemo bis: many thanks for suggesting the links.

Source URL Comments Candidate?
Dictionary of National Biography [55]   Support
History of Alabama and Dictionary of Alabama Biography [56] almost no data (except for Montgomery)   Oppose
American Medical Biographies [57]   Support
A Biographical Dictionary of Ancient, Medieval, and Modern Freethinkers [58] everything in one page, may be tricky to parse   Support
The Dictionary of Australasian Biography [59]   Support
Dictionary of Christian Biography and Literature to the End of the Sixth Century [60]   Support
A Dictionary of Artists of the English School [61] quite incomplete (only A, F, K); one page per letter, mat be tricky to parse   Support
A Short Biographical Dictionary of English Literature [62]   Support
Dictionary of Greek and Roman Biography and Mythology [63]   Support
The Indian Biographical Dictionary (1915) [64]   Support
Modern English Biography [65] really few data   Support
Who's Who, 1909 [66] 2 persons maybe
Who Was Who (1897 to 1916) [67] almost nothing in Wikisource, but full text available at archive.org   Support
A Dictionary of Music and Musicians [68] not only biographies   Support
Men-at-the-Bar [69] lots of abbreviations   Support
A Naval Biographical Dictionary [70]   Support
Makers of British botany [71] few people, very long biographies maybe
Biographies of Scientific Men [72] few people. very long biographies maybe
A Chinese Biographical Dictionary [73]   Support
Who's Who in China (3rd edition) [74]   Support
A Compendium of Irish Biography [75] no data   Oppose
Chronicle of the law officers of Ireland [76] one page per chapter, may be tricky to parse   Support
A biographical dictionary of eminent Scotsmen [77] no data   Oppose
Dictionary of National Biography, 1901 supplement [78] need to check the intersection with the original one   Support
Dictionary of National Biography, 1912 supplement [79] need to check the intersection with the original one   Support
Woman's Who's Who of America, 1914-15 [80] really few data   Support
The Biographical Dictionary of America [81], [82], [83], [84], [85], [86], [87], [88], [89], [90] almost nothing in Wikisource, but full text available at archive.org   Support
Historical and biographical sketches [91] few people, very long biographies maybe
Cartoon portraits and biographical sketches of men of the day [92]   Support
Men of the Time, eleventh edition [93]   Support

Relevant Wikidata Properties Statistics edit

The table below catches 3 different usage signals of Wikidata properties relevant to the biographical domain, namely:

This list of properties may serve as a valid starting point for the Candidate Relations Set milestone.

Label ID Frequency Ranking Unsourced Statements % External Use Domain Comments Lexical Unit Frame Range FE Candidate?
country P17 04th 48% yes places unsuitable domain   Oppose
sex or gender P21 06th 70% yes persons,

animals

use semi-structured scraped data   Support
date of birth P569 09th 37% yes person,

organism

use semi-structured scraped data bear   Support
given name P735 10th 92% no person use semi-structured scraped data   Support
occupation P106 14th 79% yes person be (very risky)

work

Being_employed Position   Support
country of citizenship P27 16th 71% yes person,

term

come from

originate

People_by_origin Origin   Support
date of death P570 21th 37% yes person,

organism

use semi-structured scraped data die   Support
place of birth P19 23th 18% yes person use semi-structured scraped data bear   Support
official name P1448 26th almost 0% yes all items tricky one, may use semi-structured scraped data Being_named Name   Support
place of death P20 32th 35% yes person use semi-structured scraped data die   Support
educated at P69 44th 77% yes person study

educate train learn

Education_teaching Institution   Support
languages spoken or written P1412 55th 32% no person no FrameNet data, should use a custom frame speak

write

  Support
position held P39 60th 82% yes person the mapping seems reasonable, but conflicts with P106 work Being_employed Position   Support
award received P166 62th 81% yes person,

organization, creative work

win Win_prize Prize   Support
member of political party P102 67th 60% yes people that are politicians   Support
family name P734 68th 70% yes persons use semi-structured scraped data   Support
creator P170 73th 39% yes N.A. inverse property (person is the range) create

co-create develop establish found generate make produce set up synthesize

Intentionally_create Created_entity (domain)

Creator (range)

  Support
author P50 75th 34% yes work inverse property (person is the range)

sub-property of creator for written works constrain to domain type = Written work

same as creator same as creator same as creator   Support
director P57 77th 18% yes work inverse property (person is the range)

seems a sub-property of creator for motion pictures, plays, video games constrain to domain types = (Movie, Play, VideoGame)

direct

co-direct

Behind_the_scenes Production (domain)

Artist (range)

  Support
member of P463 87th 92% yes person belong Membership Group   Support
participant of P1344 89th 92% no human,

group of humans, organization

engage

participate take part

Participation Event   Support
employer P108 97th 92% yes human commission

employ

Employing Employer   Support

February 2016 edit

Week 1 edit

  • The development corpus is already in a good shape, with 700,000 items ca. scraped from 50 sources. 180,000 items ca. contain raw text biographies to feed the NLP pipeline (cf. https://github.com/Wikidata/StrepHit/issues/13#issuecomment-185314081);
  • the corpus analysis module baseline is implemented, and currently yields 20,000 verb lemmas ca.;
  • we are checking whether the relevant Wikidata properties shown above can be triggered by our verb lemmas: if so, they will definitely serve as the first 21 candidate relations. The remaining 29 will be extracted according to the lexicographical and statistical rankings;
  • updating the Wikidata properties table above;

Week 2 edit

After inspecting the set of verb lemmas, we found lots of noise, mainly caused by the default tokenization logic of the POS-tagging library we used. Therefore, we implemented our own tokenizer, to be leveraged by all modules. A second run of the corpus analysis yielded 7,600 verb lemmas ca.: less items with much more quality.

Verb Rankings edit

The final output of the corpus analysis module are 2 rankings, one based on lexicographical evidence (i.e., TF/IDF), and one on statistical evidence (i.e., standard deviation). Cf. https://github.com/Wikidata/StrepHit/issues/5#issuecomment-188293650 for more technical details.

For each ranking, we intersected the top 50 lemmas with FrameNet data: the data can be found at https://github.com/Wikidata/StrepHit/issues/5#issuecomment-189376446

Week 3 edit

  • We observed that our scrapers may also contain semi-structured data, which can be a very valuable source of statements. Hence, we have been working on a Wikidata dataset: we plan to upload it to the primary sources tool backend instance, hosted at the Wikimedia Tool Labs, and announce it to the community;
  • the entity linking facility is implemented, and currently supports the Dandelion Entity Extraction API;
  • the crowdsourcing annotation module gets underway, and currently interacts with the CrowdFlower API for posting annotation jobs and pulling results;

Week 4 edit

  • started working on the extraction of sentences from the corpus: they will get sampled and serve as seeds for the training and test sets, as well for the actual classification;
  • brainstorming session to come up with three extraction strategies:
  • n2n (default), i.e., many sentences per many LUs. This entails that the same sentence is likely to be extracted multiple times;
  • 121, i.e., one sentence per LU. This entails that a single sentence will be extracted only once;
  • syntactic (to be implemented), i.e., extraction based on dependency parsing. We argue that this may be useful to split long complex sentences.
  • We have contacted the primary sources tool maintainers, and requested them to grant us either the access to the specific machine at Tool Labs, or a token for the /import service (undocumented in the tool homepage, but documented in Google's codebase);
  • currently, we are waiting for their answer, and will upload the semi-structured dataset as soon as we are granted the access;

March 2016 edit

Week 1 edit

  • testing the sentence extraction module;
  • experimenting extraction strategies:
    • basic ones (n2n, 121) are noisy;
    • synctatic is computationally intensive.
  • computing corpus statistics: sources, biography distribution;
  • working on the semi-structured dataset:
    • resolving honorifics more reliably.
  • caching facilities: general-purpose and entity linking caching.

Week 2 edit

Week 3 edit

  • second pull request to the primary sources tool codebase: https://github.com/google/primarysources/pull/87
  • first crowdsourcing job pilots:
    • input data created with the 3 extraction strategies;
    • failed, worker get confused by (a) too long sentences, and (b) difficult labels returned by FrameNet.
  • working on the integration of the crowdsourcing platform CrowdFlower:
    • dynamic generation of the worker interface based on input data;
    • automate the flow via the API;
    • interacting with the CrowdFlower help desk to fix issues in the API.
  • working on the scientific article revision;
  • investigating the relevance of the FrameNet frame repository:
    • FEs will be annotated if they can be mapped to Wikidata properties, otherwise they are useless.
  • implementing a simple FE to Wikidata properties matcher:
    • functions to retrieve Wikidata property IDs, full entity metadata, and labels and aliases only;
    • extract FEs only if they map to Wikidata properties via exact matching of labels and aliases.

Week 4 edit

  • working on the simplification of the annotation jobs:
    • filter numerical FEs, which should instead be classified directly with a rule-based strategy;
    • let extra FEs be annotated too, not only core ones;
    • FEs with no mapping to Wikidata should not be skipped, labels should be rather made understandable.
  • testing and documenting parallel processing facilities;
  • new sentence extraction strategy: grammar-based.

April 2016 edit

Week 1 edit

  • midpoint report written;
  • working on the technical documentation:
  • corpus stats: length distribution of biographies;
  • working on the scientific article revision.

Week 2 edit

  • quasi full time work on the scientific article revision;
  • dissemination: participating to the HackAtoka hackathon at SpazioDati, Trento
    • implemented a rule-based classifier for the companies domain.

Week 3 edit

Week 4 edit

May 2016 edit

Week 1 edit

Week 2 edit

  • bug fixing based on the SOD hackathon feedback (cf. #Week_4_3):
    • DSI and rkd.nl sources scraping;
    • changed wrong Wikidata property mapping with possibly high impact;
  • work on the numerical expressions (typically dates) normalization:
    • regular expressions to capture them;
    • transformation rules to fit the Wikidata data types;
    • tests;
  • first version of the supervised classifier.

Week 3 edit

Week 4 edit

  • major unplanned outcome: entities that could not be resolved to Wikidata IDs during dataset serialization may serve as new Wikidata Items.
Action: the final list of unresolved entities will be proposed to the community;
  • entities that are places should not undergo the annotation, but rather be directly classified;
  • lexical database improvements:
    • FE-to-Wikidata property mappings;
    • marked FEs that should become the subjects of the output statement;
  • plug a gazetteer as an extra set of features for the supervised classifier;
  • resolving countries of citizenship from nationalities;
  • prepare the StrepHit pipeline 1.0 beta release.

June 2016 edit

Week 1 edit

  • StrepHit pipeline version 1.0 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.0-beta
  • the ULAN scraper now harvests URLs to human-readable resources;
  • support for multiple subjects in the dataset serialization;
  • improvements to the Sphinx Wikitext documentation extension;
  • automatic model selection to pick the best model for the supervised classifiers.

Week 2 edit

Week 3 edit

  • Parametrizable script for supervised training;
  • stopwords should not be features;
  • normalization of names for better entity resolution;
  • optional feature reduction facility;
  • major change in entity resolution: resolve Wikidata QIDs by looking up linked entities URIs;
  • skipping non-linked chunks in feature extraction;
  • optional K-fold validation in training script;
  • handle qualifiers at dataset serialization.

Week 4 edit

  • Do not serialize frame elements that have a wrong class;
  • performance evaluation of supervised classifiers:
    • 10-fold cross validation;
    • comparison with dummy classifiers;
    • accuracy values against a gold standard of 249 fully annotated sentences;
  • correctly handling places;
  • StrepHit pipeline version 1.1 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.1-beta