Grants:IEG/StrepHit: Wikidata Statements Validation via References/Timeline

This project is funded by an Individual Engagement Grant

This Individual Engagement Grant is renewed

Timeline for StrepHit: Wikidata Statements Validation via References

Timeline	Date
Development Corpus	April 11 2016
Candidate Relations Set	April 11 2016
StrepHit Pipeline Beta	June 11 2016
Production Corpus	July 11 2016
Web Sources Knowledge Base	July 11 2016

Overview

Project start date: January 11, 2016
Codebase: https://github.com/Wikidata/StrepHit
Documentation: https://www.mediawiki.org/wiki/StrepHit

Monthly updates

Each update will cover a 1-month time span, starting from the 11th day. For instance, January 2016 means January 11th to February 11th 2016.

January 2016

Dissemination activities

Jan 15: Kick-off seminar at FBK, Trento, Italy

Jan 20: Talk at the event Web 3.0, il potenziale del web semantico e dei dati strutturati, Lugano, Switzerland

Event page (in Italian): http://www.ated.ch/manifestazioni/7/web-30-il-potenziale-del-web-semantico-e-dei-dati-strutturati_3194.html

Sources identification

We identified 3 candidate domains that may serve as good use cases for the project:

Biographies
Companies
Biomedical literature

Domain	Reasons
Biographies	plenty of existing data broad coverage potentially easy to find valuable primary sources perfect fit for the current prototype
Companies	relatively biased domain ad-prone content the company edits the page on the company itself low-quality data
Biomedical	great primary source, i.e., PubMed proof of usage for an Open Access corpus complex implementation

We have gathered feedback from different communities (Wikimedians following our seminars, GLAM), which seem by far to prefer the biographical domain. Hence, we have selected it and harvested the list of primary sources.

Biographies

The mix'n'match tool maintains a list of biographical catalogues, which can serve as reliable sources candidates. The table below displays an informal analysis of the catalogues available in English:

Source	URL	Comments	Candidate?
Oxford Dictionary of National Biography	[1]	subscription needed to access the full text; full version available on Wikisource	maybe
Dictionary of Welsh Biography	[2]		Support
Dictionary of art historians	[3]		Support
The Royal Society	[4]	few entries (< 1,600)	Support
Members of the European Parliament	[5]	almost no raw text	Oppose
Thyssen-Bornemisza museum	[6]		Support
BBC your paintings	[7]		Support
National Portrait Gallery	[8]	almost no raw text	Oppose
Stanford Encyclopedia of Philosophy	[9]	not only biographies (encyclopedia)	Support
A Cambridge Alumni Database	[10]	lots of abbreviations	Oppose
Australian dictionary of biographies	[11]		Support
General Division of the Order of Australia	[12]		Semi-structured
The Union List of Artist Names	[13]	Structured data (RDF) with public endpoint	Structured
AcademiaNet	[14]		Semi-structured
Appletons' Cyclopædia of American Biography	[15]	from Wikisource	Support
Artsy	[16]	commercial web site, as suggested by Spinster	Oppose
British Museum	[17]	Structured data (RDF) with public endpoint	Structured
Bénézit Dictionary of Artists	[18]	subscription needed	maybe
French theatre of the seventeenth and eighteenth centuries	[19]	few data	Semi-structured
Catholic Hierarchy Bishops	[20]	microdata	Semi-structured
China Vitae	[21]	lots of items have no actual biography	Support
Cultural Objects Name Authority	[22]		Semi-structured
Cooper Hewitt	[23]	API available (seems not to return all the text that appears in a person's page though)	Support
Design&Art Australia Online	[24]	must browse to the biography tab for full text	Support
Database of Scientific Illustrators	[25]		Semi-structured
The Dictionary of Ulster Biography	[26]		Support
Encyclopedia Brunoniana	[27]	not only biographies (encyclopedia)	Support
Early Modern Letters Online	[28]	no biographies	Oppose
Global Anabaptist Mennonite Encyclopedia Online	[29]	third-party wiki	Support
Genealogics person ID	[30]	secondary source resulting from personal research	Semi-structured
The Hermitage - Authors	[31]	no biographies	Oppose
LoC artists	[32]	no biographies	Oppose
MOMA	[33]	no biographies	Oppose
MSBI	[34]	short utterances (may be hard to parse)	Semi-structured
MUNKSROLL	[35]		Support
Metallum bands	[36]		Support
National Gallery of Art	[37]	no biographies	Oppose
National Gallery of Victoria	[38]	no biographies	Oppose
National Library of Ireland	[39]	no biographies	Oppose
Notable Names Database	[40]		Support
Belgian people and things	[41]	no biographies	Oppose
Open Library	[42]	no biographies	Oppose
ORCID	[43]	biographies may be missing	maybe
OpenPlaques	[44]	no biographies	Oppose
Project Gutenberg	[45]	no biographies	Oppose
National Library of Australia	[46]	actually links to other catalogues	Oppose
Smithsonian American Art Museum	[47]	no biographies	Oppose
Structurae persons	[48]		Semi-structured
Theatricalia	[49]	no biographies	Oppose
Web Gallery of Art	[50]	lots of embedded frames in pages; commercial web site, as suggested by Spinster	Oppose
Parliament UK	[51]		Semi-structured
Catholic Encyclopedia (1913)	[52]	Wikisource; not only biographies (encyclopedia)	Support
Baker's Biographical Dictionary of Musicians	[53]	full raw text available	Support
RKDartists	[54]	suggested by Spinster	Semi-structured

The table below instead shows a list of Wikisource sources, as per wikisource:Category:Biographical_dictionaries and wikisource:Wikisource:WikiProject_Biographical_dictionaries. @Nemo bis: many thanks for suggesting the links.

Source	URL	Comments	Candidate?
Dictionary of National Biography	[55]		Support
History of Alabama and Dictionary of Alabama Biography	[56]	almost no data (except for Montgomery)	Oppose
American Medical Biographies	[57]		Support
A Biographical Dictionary of Ancient, Medieval, and Modern Freethinkers	[58]	everything in one page, may be tricky to parse	Support
The Dictionary of Australasian Biography	[59]		Support
Dictionary of Christian Biography and Literature to the End of the Sixth Century	[60]		Support
A Dictionary of Artists of the English School	[61]	quite incomplete (only A, F, K); one page per letter, mat be tricky to parse	Support
A Short Biographical Dictionary of English Literature	[62]		Support
Dictionary of Greek and Roman Biography and Mythology	[63]		Support
The Indian Biographical Dictionary (1915)	[64]		Support
Modern English Biography	[65]	really few data	Support
Who's Who, 1909	[66]	2 persons	maybe
Who Was Who (1897 to 1916)	[67]	almost nothing in Wikisource, but full text available at archive.org	Support
A Dictionary of Music and Musicians	[68]	not only biographies	Support
Men-at-the-Bar	[69]	lots of abbreviations	Support
A Naval Biographical Dictionary	[70]		Support
Makers of British botany	[71]	few people, very long biographies	maybe
Biographies of Scientific Men	[72]	few people. very long biographies	maybe
A Chinese Biographical Dictionary	[73]		Support
Who's Who in China (3rd edition)	[74]		Support
A Compendium of Irish Biography	[75]	no data	Oppose
Chronicle of the law officers of Ireland	[76]	one page per chapter, may be tricky to parse	Support
A biographical dictionary of eminent Scotsmen	[77]	no data	Oppose
Dictionary of National Biography, 1901 supplement	[78]	need to check the intersection with the original one	Support
Dictionary of National Biography, 1912 supplement	[79]	need to check the intersection with the original one	Support
Woman's Who's Who of America, 1914-15	[80]	really few data	Support
The Biographical Dictionary of America	[81], [82], [83], [84], [85], [86], [87], [88], [89], [90]	almost nothing in Wikisource, but full text available at archive.org	Support
Historical and biographical sketches	[91]	few people, very long biographies	maybe
Cartoon portraits and biographical sketches of men of the day	[92]		Support
Men of the Time, eleventh edition	[93]		Support

Relevant Wikidata Properties Statistics

The table below catches 3 different usage signals of Wikidata properties relevant to the biographical domain, namely:

frequency ranking in Items, as per d:Wikidata:Database_reports/List_of_properties/Top100;
percentage of unsourced statements, as per http://tools.wmflabs.org/wd-analyst/ ;
external use, as per d:Special:WhatLinksHere/Template:ExternalUse

This list of properties may serve as a valid starting point for the Candidate Relations Set milestone.

Label	ID	Frequency Ranking	Unsourced Statements %	External Use	Domain	Comments	Lexical Unit	Frame	Range FE	Candidate?
country	P17	04th	48%	yes	places	unsuitable domain				Oppose
sex or gender	P21	06th	70%	yes	persons, animals	use semi-structured scraped data				Support
date of birth	P569	09th	37%	yes	person, organism	use semi-structured scraped data	`bear`			Support
given name	P735	10th	92%	no	person	use semi-structured scraped data				Support
occupation	P106	14th	79%	yes	person		`be` (very risky) `work`	`Being_employed`	`Position`	Support
country of citizenship	P27	16th	71%	yes	person, term		`come from` `originate`	`People_by_origin`	`Origin`	Support
date of death	P570	21th	37%	yes	person, organism	use semi-structured scraped data	`die`			Support
place of birth	P19	23th	18%	yes	person	use semi-structured scraped data	`bear`			Support
official name	P1448	26th	almost 0%	yes	all items	tricky one, may use semi-structured scraped data		`Being_named`	`Name`	Support
place of death	P20	32th	35%	yes	person	use semi-structured scraped data	`die`			Support
educated at	P69	44th	77%	yes	person		`study` `educate` `train` `learn`	`Education_teaching`	`Institution`	Support
languages spoken or written	P1412	55th	32%	no	person	no FrameNet data, should use a custom frame	`speak` `write`			Support
position held	P39	60th	82%	yes	person	the mapping seems reasonable, but conflicts with P106	`work`	`Being_employed`	`Position`	Support
award received	P166	62th	81%	yes	person, organization, creative work		`win`	`Win_prize`	`Prize`	Support
member of political party	P102	67th	60%	yes	people that are politicians					Support
family name	P734	68th	70%	yes	persons	use semi-structured scraped data				Support
creator	P170	73th	39%	yes	N.A.	inverse property (person is the range)	`create` `co-create` `develop` `establish` `found` `generate` `make` `produce` `set up` `synthesize`	`Intentionally_create`	`Created_entity` (domain) `Creator` (range)	Support
author	P50	75th	34%	yes	work	inverse property (person is the range) sub-property of creator for written works constrain to domain type = Written work	same as creator	same as creator	same as creator	Support
director	P57	77th	18%	yes	work	inverse property (person is the range) seems a sub-property of creator for motion pictures, plays, video games constrain to domain types = (Movie, Play, VideoGame)	`direct` `co-direct`	`Behind_the_scenes`	`Production` (domain) `Artist` (range)	Support
member of	P463	87th	92%	yes	person		`belong`	`Membership`	`Group`	Support
participant of	P1344	89th	92%	no	human, group of humans, organization		`engage` `participate` `take part`	`Participation`	`Event`	Support
employer	P108	97th	92%	yes	human		`commission` `employ`	`Employing`	`Employer`	Support

February 2016

Week 1

The development corpus is already in a good shape, with 700,000 items ca. scraped from 50 sources. 180,000 items ca. contain raw text biographies to feed the NLP pipeline (cf. https://github.com/Wikidata/StrepHit/issues/13#issuecomment-185314081);
the corpus analysis module baseline is implemented, and currently yields 20,000 verb lemmas ca.;
we are checking whether the relevant Wikidata properties shown above can be triggered by our verb lemmas: if so, they will definitely serve as the first 21 candidate relations. The remaining 29 will be extracted according to the lexicographical and statistical rankings;
updating the Wikidata properties table above;

Week 2

After inspecting the set of verb lemmas, we found lots of noise, mainly caused by the default tokenization logic of the POS-tagging library we used. Therefore, we implemented our own tokenizer, to be leveraged by all modules. A second run of the corpus analysis yielded 7,600 verb lemmas ca.: less items with much more quality.

Verb Rankings

The final output of the corpus analysis module are 2 rankings, one based on lexicographical evidence (i.e., TF/IDF), and one on statistical evidence (i.e., standard deviation). Cf. https://github.com/Wikidata/StrepHit/issues/5#issuecomment-188293650 for more technical details.

For each ranking, we intersected the top 50 lemmas with FrameNet data: the data can be found at https://github.com/Wikidata/StrepHit/issues/5#issuecomment-189376446

Week 3

We observed that our scrapers may also contain semi-structured data, which can be a very valuable source of statements. Hence, we have been working on a Wikidata dataset: we plan to upload it to the primary sources tool backend instance, hosted at the Wikimedia Tool Labs, and announce it to the community;
the entity linking facility is implemented, and currently supports the Dandelion Entity Extraction API;
the crowdsourcing annotation module gets underway, and currently interacts with the CrowdFlower API for posting annotation jobs and pulling results;

Week 4

started working on the extraction of sentences from the corpus: they will get sampled and serve as seeds for the training and test sets, as well for the actual classification;
brainstorming session to come up with three extraction strategies:

n2n (default), i.e., many sentences per many LUs. This entails that the same sentence is likely to be extracted multiple times;
121, i.e., one sentence per LU. This entails that a single sentence will be extracted only once;
syntactic (to be implemented), i.e., extraction based on dependency parsing. We argue that this may be useful to split long complex sentences.

We have contacted the primary sources tool maintainers, and requested them to grant us either the access to the specific machine at Tool Labs, or a token for the /import service (undocumented in the tool homepage, but documented in Google's codebase);
currently, we are waiting for their answer, and will upload the semi-structured dataset as soon as we are granted the access;

March 2016

Week 1

testing the sentence extraction module;
experimenting extraction strategies:
- basic ones (n2n, 121) are noisy;
- synctatic is computationally intensive.
computing corpus statistics: sources, biography distribution;
working on the semi-structured dataset:
- resolving honorifics more reliably.
caching facilities: general-purpose and entity linking caching.

Week 2

got access to the Tool Labs machine serving the primary sources tool;
interacting with the tool maintainers to understand the architecture;
working on the semi-structured dataset:
- improved named entity resolution;
- data serialization;
- date handling.
refactoring the sentence extraction module: parallel processing;
refactoring the POS tagger interface;
English-specific issue during sentence extraction: verb base forms may be ambiguous with nouns, so ensure that only verbs are extracted through POS tags;
before uploading the semi-structured dataset, we need to guarantee that all the specific measures of success (cf. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Measures_of_Success) are technically implemented:
- first pull request to the primary sources tool codebase: https://github.com/google/primarysources/pull/86
dissemination: working on the revision of a scientific article submitted to the Semantic Web Journal: http://semantic-web-journal.org/content/n-ary-relation-extraction-joint-t-box-and-box-knowledge-base-augmentation

Week 3

second pull request to the primary sources tool codebase: https://github.com/google/primarysources/pull/87
first crowdsourcing job pilots:
- input data created with the 3 extraction strategies;
- failed, worker get confused by (a) too long sentences, and (b) difficult labels returned by FrameNet.
working on the integration of the crowdsourcing platform CrowdFlower:
- dynamic generation of the worker interface based on input data;
- automate the flow via the API;
- interacting with the CrowdFlower help desk to fix issues in the API.
working on the scientific article revision;
investigating the relevance of the FrameNet frame repository:
- FEs will be annotated if they can be mapped to Wikidata properties, otherwise they are useless.
implementing a simple FE to Wikidata properties matcher:
- functions to retrieve Wikidata property IDs, full entity metadata, and labels and aliases only;
- extract FEs only if they map to Wikidata properties via exact matching of labels and aliases.

Week 4

working on the simplification of the annotation jobs:
- filter numerical FEs, which should instead be classified directly with a rule-based strategy;
- let extra FEs be annotated too, not only core ones;
- FEs with no mapping to Wikidata should not be skipped, labels should be rather made understandable.
testing and documenting parallel processing facilities;
new sentence extraction strategy: grammar-based.

April 2016

Week 1

midpoint report written;
working on the technical documentation:
- generate Sphinx-based Python docs for MediaWiki;
- pull request to Sphinx for handling Wiki Syntax: https://github.com/sphinx-doc/sphinx/pull/2444
corpus stats: length distribution of biographies;
working on the scientific article revision.

Week 2

quasi full time work on the scientific article revision;
dissemination: participating to the HackAtoka hackathon at SpazioDati, Trento
- implemented a rule-based classifier for the companies domain.

Week 3

full time work on the scientific article revision;
revision submitted at the Semantic Web Journal: http://semantic-web-journal.org/content/n-ary-relation-extraction-simultaneous-t-box-and-box-knowledge-base-augmentation

Week 4

full time work on the primary sources tool:
- major refactoring by the main primary sources tool backend maintainer seems to have broken our previous pull request: this is preventing us from gathering quantitative measures of success;
- fixed via https://github.com/google/primarysources/pull/100;
- enabled dataset-specific top users statistics via https://github.com/google/primarysources/pull/102;
held a hackathon at Spaghetti Open Data Reunion to test the primary sources tool with the semi-structured dataset: http://www.spaghettiopendata.org/content/wikidata-la-banca-di-conoscenza-libera-casa-wikimedia
- Italian Wikimedians attended the hackathon and helped us a lot. @CristianCantoro, Laurentius, Jaqen, and CristianNX: thank you all for your precious work!
- gathered particularly valuable feedback both on the tool and opened issues accordingly:
  - tool usability [94], [95], [96], [97], [98]
  - StrepHit semi-structured dataset [99], [100], [101]

May 2016

Week 1

side project: Wikimedia Italy asked us to take first steps towards the integration of a Wiki Loves Monuments Italy dataset in Wikidata;
- cf. wikidata:Wikidata:Project_chat/Archive/2016/06#Importing_Wiki_Loves_Monuments_lists_into_Wikidata;
- Download the prototype data;
focus on crowdsourcing the annotation job to build the training set:
- tested the input data;
- improved the worker interface, i.e., more detailed instructions, display the LU instead of the frame;
noticed that certain sources (e.g., ULAN) already have dedicated Wikidata properties for their IDs: added mappings;
further reliable sources in English:
first version of a rule-based classification approach.

Week 2

bug fixing based on the SOD hackathon feedback (cf. #Week_4_3):
- DSI and rkd.nl sources scraping;
- changed wrong Wikidata property mapping with possibly high impact;
work on the numerical expressions (typically dates) normalization:
- regular expressions to capture them;
- transformation rules to fit the Wikidata data types;
- tests;
first version of the supervised classifier.

Week 3

attended and actively participated to WikiCite_2016:
- submitted a proposal: WikiCite_2016/Proposals/Generation_of_referenced_Wikidata_statements_with_StrepHit;
- established and led work group 4: WikiCite_2016/Report/Group_4;
- engaged potential data donors for the primary sources tool;
- opened a Request for Comments to centralize feedback on the primary sources tool and the StrepHit dataset: wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements;
resolve family tree in genealogics.org;
first version of the dataset serializer;
improved documentation:
- Sphinx extension to handle the Mediawiki syntax;
parallel entity linking;
integration of WikiCite_2016 feedback.

Week 4

major unplanned outcome: entities that could not be resolved to Wikidata IDs during dataset serialization may serve as new Wikidata Items.

Action: the final list of unresolved entities will be proposed to the community;

entities that are places should not undergo the annotation, but rather be directly classified;
lexical database improvements:
- FE-to-Wikidata property mappings;
- marked FEs that should become the subjects of the output statement;
plug a gazetteer as an extra set of features for the supervised classifier;
resolving countries of citizenship from nationalities;
prepare the StrepHit pipeline 1.0 beta release.

June 2016

Week 1

StrepHit pipeline version 1.0 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.0-beta
the ULAN scraper now harvests URLs to human-readable resources;
support for multiple subjects in the dataset serialization;
improvements to the Sphinx Wikitext documentation extension;
automatic model selection to pick the best model for the supervised classifiers.

Week 2

High priority to the supervised classifiers:
- experiments on word embedding approaches;
- included lexical units as features;
- use one classifier per lexical unit;
attended Wikimania 2016: https://wikimania2016.wikimedia.org
- presented a poster: https://wikimania2016.wikimedia.org/wiki/Posters#StrepHit

Week 3

Parametrizable script for supervised training;
stopwords should not be features;
normalization of names for better entity resolution;
optional feature reduction facility;
major change in entity resolution: resolve Wikidata QIDs by looking up linked entities URIs;
skipping non-linked chunks in feature extraction;
optional K-fold validation in training script;
handle qualifiers at dataset serialization.

Week 4

Do not serialize frame elements that have a wrong class;
performance evaluation of supervised classifiers:
- 10-fold cross validation;
- comparison with dummy classifiers;
- accuracy values against a gold standard of 249 fully annotated sentences;
correctly handling places;
StrepHit pipeline version 1.1 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.1-beta