Grants:IEG/StrepHit: Wikidata Statements Validation via References/Timeline
This project is funded by an Individual Engagement Grant
This Individual Engagement Grant is renewed
renewal scope | timeline & progress | finances | midpoint report | final report |
Timeline for StrepHit: Wikidata Statements Validation via References
editTimeline | Date |
Development Corpus | April 11 2016 |
Candidate Relations Set | April 11 2016 |
StrepHit Pipeline Beta | June 11 2016 |
Production Corpus | July 11 2016 |
Web Sources Knowledge Base | July 11 2016 |
Overview
edit- Project start date: January 11, 2016
- Codebase:
https://github.com/Wikidata/StrepHit
- Documentation:
https://www.mediawiki.org/wiki/StrepHit
Monthly updates
editEach update will cover a 1-month time span, starting from the 11th day. For instance, January 2016 means January 11th to February 11th 2016.
January 2016
editDissemination activities
edit- Jan 15: Kick-off seminar at FBK, Trento, Italy
- Jan 20: Talk at the event
Web 3.0, il potenziale del web semantico e dei dati strutturati
, Lugano, Switzerland
Sources identification
editWe identified 3 candidate domains that may serve as good use cases for the project:
- Biographies
- Companies
- Biomedical literature
Domain | Reasons |
---|---|
Biographies |
|
Companies |
|
Biomedical |
|
We have gathered feedback from different communities (Wikimedians following our seminars, GLAM), which seem by far to prefer the biographical domain. Hence, we have selected it and harvested the list of primary sources.
Biographies
editThe mix'n'match tool maintains a list of biographical catalogues, which can serve as reliable sources candidates. The table below displays an informal analysis of the catalogues available in English:
Source | URL | Comments | Candidate? |
---|---|---|---|
Oxford Dictionary of National Biography | [1] | subscription needed to access the full text; full version available on Wikisource | maybe |
Dictionary of Welsh Biography | [2] | Support | |
Dictionary of art historians | [3] | Support | |
The Royal Society | [4] | few entries (< 1,600) | Support |
Members of the European Parliament | [5] | almost no raw text | Oppose |
Thyssen-Bornemisza museum | [6] | Support | |
BBC your paintings | [7] | Support | |
National Portrait Gallery | [8] | almost no raw text | Oppose |
Stanford Encyclopedia of Philosophy | [9] | not only biographies (encyclopedia) | Support |
A Cambridge Alumni Database | [10] | lots of abbreviations | Oppose |
Australian dictionary of biographies | [11] | Support | |
General Division of the Order of Australia | [12] | Semi-structured | |
The Union List of Artist Names | [13] | Structured data (RDF) with public endpoint | Structured |
AcademiaNet | [14] | Semi-structured | |
Appletons' Cyclopædia of American Biography | [15] | from Wikisource | Support |
Artsy | [16] | commercial web site, as suggested by Spinster | Oppose |
British Museum | [17] | Structured data (RDF) with public endpoint | Structured |
Bénézit Dictionary of Artists | [18] | subscription needed | maybe |
French theatre of the seventeenth and eighteenth centuries | [19] | few data | Semi-structured |
Catholic Hierarchy Bishops | [20] | microdata | Semi-structured |
China Vitae | [21] | lots of items have no actual biography | Support |
Cultural Objects Name Authority | [22] | Semi-structured | |
Cooper Hewitt | [23] | API available (seems not to return all the text that appears in a person's page though) | Support |
Design&Art Australia Online | [24] | must browse to the biography tab for full text | Support |
Database of Scientific Illustrators | [25] | Semi-structured | |
The Dictionary of Ulster Biography | [26] | Support | |
Encyclopedia Brunoniana | [27] | not only biographies (encyclopedia) | Support |
Early Modern Letters Online | [28] | no biographies | Oppose |
Global Anabaptist Mennonite Encyclopedia Online | [29] | third-party wiki | Support |
Genealogics person ID | [30] | secondary source resulting from personal research | Semi-structured |
The Hermitage - Authors | [31] | no biographies | Oppose |
LoC artists | [32] | no biographies | Oppose |
MOMA | [33] | no biographies | Oppose |
MSBI | [34] | short utterances (may be hard to parse) | Semi-structured |
MUNKSROLL | [35] | Support | |
Metallum bands | [36] | Support | |
National Gallery of Art | [37] | no biographies | Oppose |
National Gallery of Victoria | [38] | no biographies | Oppose |
National Library of Ireland | [39] | no biographies | Oppose |
Notable Names Database | [40] | Support | |
Belgian people and things | [41] | no biographies | Oppose |
Open Library | [42] | no biographies | Oppose |
ORCID | [43] | biographies may be missing | maybe |
OpenPlaques | [44] | no biographies | Oppose |
Project Gutenberg | [45] | no biographies | Oppose |
National Library of Australia | [46] | actually links to other catalogues | Oppose |
Smithsonian American Art Museum | [47] | no biographies | Oppose |
Structurae persons | [48] | Semi-structured | |
Theatricalia | [49] | no biographies | Oppose |
Web Gallery of Art | [50] | lots of embedded frames in pages; commercial web site, as suggested by Spinster | Oppose |
Parliament UK | [51] | Semi-structured | |
Catholic Encyclopedia (1913) | [52] | Wikisource; not only biographies (encyclopedia) | Support |
Baker's Biographical Dictionary of Musicians | [53] | full raw text available | Support |
RKDartists | [54] | suggested by Spinster | Semi-structured |
The table below instead shows a list of Wikisource sources, as per wikisource:Category:Biographical_dictionaries and wikisource:Wikisource:WikiProject_Biographical_dictionaries. @Nemo bis: many thanks for suggesting the links.
Source | URL | Comments | Candidate? |
---|---|---|---|
Dictionary of National Biography | [55] | Support | |
History of Alabama and Dictionary of Alabama Biography | [56] | almost no data (except for Montgomery) | Oppose |
American Medical Biographies | [57] | Support | |
A Biographical Dictionary of Ancient, Medieval, and Modern Freethinkers | [58] | everything in one page, may be tricky to parse | Support |
The Dictionary of Australasian Biography | [59] | Support | |
Dictionary of Christian Biography and Literature to the End of the Sixth Century | [60] | Support | |
A Dictionary of Artists of the English School | [61] | quite incomplete (only A, F, K); one page per letter, mat be tricky to parse | Support |
A Short Biographical Dictionary of English Literature | [62] | Support | |
Dictionary of Greek and Roman Biography and Mythology | [63] | Support | |
The Indian Biographical Dictionary (1915) | [64] | Support | |
Modern English Biography | [65] | really few data | Support |
Who's Who, 1909 | [66] | 2 persons | maybe |
Who Was Who (1897 to 1916) | [67] | almost nothing in Wikisource, but full text available at archive.org | Support |
A Dictionary of Music and Musicians | [68] | not only biographies | Support |
Men-at-the-Bar | [69] | lots of abbreviations | Support |
A Naval Biographical Dictionary | [70] | Support | |
Makers of British botany | [71] | few people, very long biographies | maybe |
Biographies of Scientific Men | [72] | few people. very long biographies | maybe |
A Chinese Biographical Dictionary | [73] | Support | |
Who's Who in China (3rd edition) | [74] | Support | |
A Compendium of Irish Biography | [75] | no data | Oppose |
Chronicle of the law officers of Ireland | [76] | one page per chapter, may be tricky to parse | Support |
A biographical dictionary of eminent Scotsmen | [77] | no data | Oppose |
Dictionary of National Biography, 1901 supplement | [78] | need to check the intersection with the original one | Support |
Dictionary of National Biography, 1912 supplement | [79] | need to check the intersection with the original one | Support |
Woman's Who's Who of America, 1914-15 | [80] | really few data | Support |
The Biographical Dictionary of America | [81], [82], [83], [84], [85], [86], [87], [88], [89], [90] | almost nothing in Wikisource, but full text available at archive.org | Support |
Historical and biographical sketches | [91] | few people, very long biographies | maybe |
Cartoon portraits and biographical sketches of men of the day | [92] | Support | |
Men of the Time, eleventh edition | [93] | Support |
Relevant Wikidata Properties Statistics
editThe table below catches 3 different usage signals of Wikidata properties relevant to the biographical domain, namely:
- frequency ranking in Items, as per d:Wikidata:Database_reports/List_of_properties/Top100;
- percentage of unsourced statements, as per http://tools.wmflabs.org/wd-analyst/ ;
- external use, as per d:Special:WhatLinksHere/Template:ExternalUse
This list of properties may serve as a valid starting point for the Candidate Relations Set milestone.
Label | ID | Frequency Ranking | Unsourced Statements % | External Use | Domain | Comments | Lexical Unit | Frame | Range FE | Candidate? |
---|---|---|---|---|---|---|---|---|---|---|
country | P17 | 04th | 48% | yes | places | unsuitable domain | Oppose | |||
sex or gender | P21 | 06th | 70% | yes | persons,
animals |
use semi-structured scraped data | Support | |||
date of birth | P569 | 09th | 37% | yes | person,
organism |
use semi-structured scraped data | bear
|
Support | ||
given name | P735 | 10th | 92% | no | person | use semi-structured scraped data | Support | |||
occupation | P106 | 14th | 79% | yes | person | be (very risky)
|
Being_employed
|
Position
|
Support | |
country of citizenship | P27 | 16th | 71% | yes | person,
term |
come from
|
People_by_origin
|
Origin
|
Support | |
date of death | P570 | 21th | 37% | yes | person,
organism |
use semi-structured scraped data | die
|
Support | ||
place of birth | P19 | 23th | 18% | yes | person | use semi-structured scraped data | bear
|
Support | ||
official name | P1448 | 26th | almost 0% | yes | all items | tricky one, may use semi-structured scraped data | Being_named
|
Name
|
Support | |
place of death | P20 | 32th | 35% | yes | person | use semi-structured scraped data | die
|
Support | ||
educated at | P69 | 44th | 77% | yes | person | study
|
Education_teaching
|
Institution
|
Support | |
languages spoken or written | P1412 | 55th | 32% | no | person | no FrameNet data, should use a custom frame | speak
|
Support | ||
position held | P39 | 60th | 82% | yes | person | the mapping seems reasonable, but conflicts with P106 | work
|
Being_employed
|
Position
|
Support |
award received | P166 | 62th | 81% | yes | person,
organization, creative work |
win
|
Win_prize
|
Prize
|
Support | |
member of political party | P102 | 67th | 60% | yes | people that are politicians | Support | ||||
family name | P734 | 68th | 70% | yes | persons | use semi-structured scraped data | Support | |||
creator | P170 | 73th | 39% | yes | N.A. | inverse property (person is the range) | create
|
Intentionally_create
|
Created_entity (domain)
|
Support |
author | P50 | 75th | 34% | yes | work | inverse property (person is the range)
sub-property of creator for written works constrain to domain type = Written work |
same as creator | same as creator | same as creator | Support |
director | P57 | 77th | 18% | yes | work | inverse property (person is the range)
seems a sub-property of creator for motion pictures, plays, video games constrain to domain types = (Movie, Play, VideoGame) |
direct
|
Behind_the_scenes
|
Production (domain)
|
Support |
member of | P463 | 87th | 92% | yes | person | belong
|
Membership
|
Group
|
Support | |
participant of | P1344 | 89th | 92% | no | human,
group of humans, organization |
engage
|
Participation
|
Event
|
Support | |
employer | P108 | 97th | 92% | yes | human | commission
|
Employing
|
Employer
|
Support |
February 2016
editWeek 1
edit- The development corpus is already in a good shape, with 700,000 items ca. scraped from 50 sources. 180,000 items ca. contain raw text biographies to feed the NLP pipeline (cf. https://github.com/Wikidata/StrepHit/issues/13#issuecomment-185314081);
- the corpus analysis module baseline is implemented, and currently yields 20,000 verb lemmas ca.;
- we are checking whether the relevant Wikidata properties shown above can be triggered by our verb lemmas: if so, they will definitely serve as the first 21 candidate relations. The remaining 29 will be extracted according to the lexicographical and statistical rankings;
- updating the Wikidata properties table above;
Week 2
editAfter inspecting the set of verb lemmas, we found lots of noise, mainly caused by the default tokenization logic of the POS-tagging library we used. Therefore, we implemented our own tokenizer, to be leveraged by all modules. A second run of the corpus analysis yielded 7,600 verb lemmas ca.: less items with much more quality.
Verb Rankings
editThe final output of the corpus analysis module are 2 rankings, one based on lexicographical evidence (i.e., TF/IDF), and one on statistical evidence (i.e., standard deviation). Cf. https://github.com/Wikidata/StrepHit/issues/5#issuecomment-188293650 for more technical details.
For each ranking, we intersected the top 50 lemmas with FrameNet data: the data can be found at https://github.com/Wikidata/StrepHit/issues/5#issuecomment-189376446
Week 3
edit- We observed that our scrapers may also contain semi-structured data, which can be a very valuable source of statements. Hence, we have been working on a Wikidata dataset: we plan to upload it to the primary sources tool backend instance, hosted at the Wikimedia Tool Labs, and announce it to the community;
- the entity linking facility is implemented, and currently supports the Dandelion Entity Extraction API;
- the crowdsourcing annotation module gets underway, and currently interacts with the CrowdFlower API for posting annotation jobs and pulling results;
Week 4
edit- started working on the extraction of sentences from the corpus: they will get sampled and serve as seeds for the training and test sets, as well for the actual classification;
- brainstorming session to come up with three extraction strategies:
n2n
(default), i.e., many sentences per many LUs. This entails that the same sentence is likely to be extracted multiple times;121
, i.e., one sentence per LU. This entails that a single sentence will be extracted only once;syntactic
(to be implemented), i.e., extraction based on dependency parsing. We argue that this may be useful to split long complex sentences.
- We have contacted the primary sources tool maintainers, and requested them to grant us either the access to the specific machine at Tool Labs, or a token for the
/import
service (undocumented in the tool homepage, but documented in Google's codebase); - currently, we are waiting for their answer, and will upload the semi-structured dataset as soon as we are granted the access;
March 2016
editWeek 1
edit- testing the sentence extraction module;
- experimenting extraction strategies:
- basic ones (
n2n
,121
) are noisy; synctatic
is computationally intensive.
- basic ones (
- computing corpus statistics: sources, biography distribution;
- working on the semi-structured dataset:
- resolving honorifics more reliably.
- caching facilities: general-purpose and entity linking caching.
Week 2
edit- got access to the Tool Labs machine serving the primary sources tool;
- interacting with the tool maintainers to understand the architecture;
- working on the semi-structured dataset:
- improved named entity resolution;
- data serialization;
- date handling.
- refactoring the sentence extraction module: parallel processing;
- refactoring the POS tagger interface;
- English-specific issue during sentence extraction: verb base forms may be ambiguous with nouns, so ensure that only verbs are extracted through POS tags;
- before uploading the semi-structured dataset, we need to guarantee that all the specific measures of success (cf. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Measures_of_Success) are technically implemented:
- first pull request to the primary sources tool codebase: https://github.com/google/primarysources/pull/86
- dissemination: working on the revision of a scientific article submitted to the Semantic Web Journal: http://semantic-web-journal.org/content/n-ary-relation-extraction-joint-t-box-and-box-knowledge-base-augmentation
Week 3
edit- second pull request to the primary sources tool codebase: https://github.com/google/primarysources/pull/87
- first crowdsourcing job pilots:
- input data created with the 3 extraction strategies;
- failed, worker get confused by (a) too long sentences, and (b) difficult labels returned by FrameNet.
- working on the integration of the crowdsourcing platform CrowdFlower:
- dynamic generation of the worker interface based on input data;
- automate the flow via the API;
- interacting with the CrowdFlower help desk to fix issues in the API.
- working on the scientific article revision;
- investigating the relevance of the FrameNet frame repository:
- FEs will be annotated if they can be mapped to Wikidata properties, otherwise they are useless.
- implementing a simple FE to Wikidata properties matcher:
- functions to retrieve Wikidata property IDs, full entity metadata, and labels and aliases only;
- extract FEs only if they map to Wikidata properties via exact matching of labels and aliases.
Week 4
edit- working on the simplification of the annotation jobs:
- filter numerical FEs, which should instead be classified directly with a rule-based strategy;
- let extra FEs be annotated too, not only core ones;
- FEs with no mapping to Wikidata should not be skipped, labels should be rather made understandable.
- testing and documenting parallel processing facilities;
- new sentence extraction strategy:
grammar
-based.
April 2016
editWeek 1
edit- midpoint report written;
- working on the technical documentation:
- generate Sphinx-based Python docs for MediaWiki;
- pull request to Sphinx for handling Wiki Syntax: https://github.com/sphinx-doc/sphinx/pull/2444
- corpus stats: length distribution of biographies;
- working on the scientific article revision.
Week 2
edit- quasi full time work on the scientific article revision;
- dissemination: participating to the HackAtoka hackathon at SpazioDati, Trento
- implemented a rule-based classifier for the companies domain.
Week 3
edit- full time work on the scientific article revision;
- revision submitted at the Semantic Web Journal: http://semantic-web-journal.org/content/n-ary-relation-extraction-simultaneous-t-box-and-box-knowledge-base-augmentation
Week 4
edit- full time work on the primary sources tool:
- major refactoring by the main primary sources tool backend maintainer seems to have broken our previous pull request: this is preventing us from gathering quantitative measures of success;
- fixed via https://github.com/google/primarysources/pull/100;
- enabled dataset-specific top users statistics via https://github.com/google/primarysources/pull/102;
- held a hackathon at Spaghetti Open Data Reunion to test the primary sources tool with the semi-structured dataset: http://www.spaghettiopendata.org/content/wikidata-la-banca-di-conoscenza-libera-casa-wikimedia
- Italian Wikimedians attended the hackathon and helped us a lot. @CristianCantoro, Laurentius, Jaqen, and CristianNX: thank you all for your precious work!
- gathered particularly valuable feedback both on the tool and opened issues accordingly:
May 2016
editWeek 1
edit- side project: Wikimedia Italy asked us to take first steps towards the integration of a Wiki Loves Monuments Italy dataset in Wikidata;
- focus on crowdsourcing the annotation job to build the training set:
- tested the input data;
- improved the worker interface, i.e., more detailed instructions, display the LU instead of the frame;
- noticed that certain sources (e.g., ULAN) already have dedicated Wikidata properties for their IDs: added mappings;
- further reliable sources in English:
- first version of a rule-based classification approach.
Week 2
edit- bug fixing based on the SOD hackathon feedback (cf. #Week_4_3):
- DSI and rkd.nl sources scraping;
- changed wrong Wikidata property mapping with possibly high impact;
- work on the numerical expressions (typically dates) normalization:
- regular expressions to capture them;
- transformation rules to fit the Wikidata data types;
- tests;
- first version of the supervised classifier.
Week 3
edit- attended and actively participated to WikiCite_2016:
- submitted a proposal: WikiCite_2016/Proposals/Generation_of_referenced_Wikidata_statements_with_StrepHit;
- established and led work group 4: WikiCite_2016/Report/Group_4;
- engaged potential data donors for the primary sources tool;
- opened a Request for Comments to centralize feedback on the primary sources tool and the StrepHit dataset: wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements;
- resolve family tree in genealogics.org;
- first version of the dataset serializer;
- improved documentation:
- Sphinx extension to handle the Mediawiki syntax;
- parallel entity linking;
- integration of WikiCite_2016 feedback.
Week 4
edit- major unplanned outcome: entities that could not be resolved to Wikidata IDs during dataset serialization may serve as new Wikidata Items.
- Action: the final list of unresolved entities will be proposed to the community;
- entities that are places should not undergo the annotation, but rather be directly classified;
- lexical database improvements:
- FE-to-Wikidata property mappings;
- marked FEs that should become the subjects of the output statement;
- plug a gazetteer as an extra set of features for the supervised classifier;
- resolving countries of citizenship from nationalities;
- prepare the StrepHit pipeline 1.0 beta release.
June 2016
editWeek 1
edit- StrepHit pipeline version 1.0 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.0-beta
- the ULAN scraper now harvests URLs to human-readable resources;
- support for multiple subjects in the dataset serialization;
- improvements to the Sphinx Wikitext documentation extension;
- automatic model selection to pick the best model for the supervised classifiers.
Week 2
edit- High priority to the supervised classifiers:
- experiments on word embedding approaches;
- included lexical units as features;
- use one classifier per lexical unit;
- attended Wikimania 2016: https://wikimania2016.wikimedia.org
- presented a poster: https://wikimania2016.wikimedia.org/wiki/Posters#StrepHit
Week 3
edit- Parametrizable script for supervised training;
- stopwords should not be features;
- normalization of names for better entity resolution;
- optional feature reduction facility;
- major change in entity resolution: resolve Wikidata QIDs by looking up linked entities URIs;
- skipping non-linked chunks in feature extraction;
- optional K-fold validation in training script;
- handle qualifiers at dataset serialization.
Week 4
edit- Do not serialize frame elements that have a wrong class;
- performance evaluation of supervised classifiers:
- 10-fold cross validation;
- comparison with dummy classifiers;
- accuracy values against a gold standard of 249 fully annotated sentences;
- correctly handling places;
- StrepHit pipeline version 1.1 Beta released and announced: https://github.com/Wikidata/StrepHit/releases/tag/1.1-beta