This project is funded by an Individual Engagement Grant

This Individual Engagement Grant is renewed

Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.

Part 1: The Project

Summary

In a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.

       Planned achievements, as per the project goals and timeline:

Web Sources Production Corpus: 1.8 M items, 515 k documents (biographies), 53 reliable sources;
Candidate Relations Set: 49 frames, 229 total frame elements, 133 unique frame elements, 69 unique Wikidata relations;
StrepHit Pipeline Beta: v. 1.0 beta, v. 1.1 beta;
Web Sources Knowledge Base: 842 k confident claims + 958 k supervised + 808 k rule-based = 2.6 M total claims;
Primary Sources Tool: 5 merged pull requests, active request for comment.

       Bonus achievements, beyond the goals:

Web Sources Corpus: +265 k (+106%) documents, +3 sources;
Candidate Relations Set: +19 (+38%) Wikidata relations;
Web Sources Knowledge Base: +359 k (+16%) Wikidata claims;
Candidate Items dataset: a set of entities found in the corpus that could be added to Wikidata (needs validation);
Wiki Loves Monuments Italy: a prototype dataset for Wikidata;
Italian companies dataset: a proof-of-scalability dataset (another language, another domain), as a result of the HackAtoka hackathon.

       Codebase: 9,425 lines of Python code, 2 releases, 474 commits, 16 open issues, 37 closed issues.

Access all the resources here: #Project_resources
Play with the datasets. Read the instructions and provide feedback here: wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements

Methods and activities

What did you do in your project?

Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 3 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.

Main room for the hackathon held at the public library of Trento, Italy, during the Spaghetti Open Data Reunion 2016

The project has been managed as per Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Methods_and_activities.

Dissemination

As detailed in the planned outreach activities, the April and May monthly reports, we conducted the following dissemination efforts after the midpoint.

HackAtoka hackathon at SpazioDati: http://blog.atoka.io/hackatoka-open-innovation-al-lavoro-per-testare-le-nuove-atoka-api/ (in Italian)
- the StrepHit team in action, picture 1: http://blog.atoka.io/wp-content/uploads/2016/05/hackAtoka-brainstorming.jpg
- picture 2: http://blog.atoka.io/wp-content/uploads/2016/05/hackAtoka-MachineReadingNewsAPI-1024x683.jpg
major revision of the research article submitted to the Semantic Web Journal: http://semantic-web-journal.org/content/n-ary-relation-extraction-simultaneous-t-box-and-box-knowledge-base-augmentation
hackathon at Spaghetti Open Data Reunion: http://www.spaghettiopendata.org/content/wikidata-la-banca-di-conoscenza-libera-casa-wikimedia
- see the attendees: https://twitter.com/SignoraRamsay/status/728873548643770368/
WikiCite 2016: WikiCite_2016, WikiCite_2016/Proposals/Generation_of_referenced_Wikidata_statements_with_StrepHit, WikiCite_2016/Report/Group_4
Poster at Wikimania 2016: https://wikimania2016.wikimedia.org/wiki/Posters#StrepHit

Side Projects

Besides StrepHit, we have been contributing to the following projects:

Primary Sources Tool, with 5 merged pull requests [1], [2], [3], [4], [5]
Prototype import of Wiki Loves Monument Italy into Wikidata: http://it.dbpedia.org/downloads/strephit/wlm_italy_prototype/, wikidata:Wikidata:Project_chat/Archive/2016/06#Importing_Wiki_Loves_Monuments_lists_into_Wikidata
Sphinx Python documentation builder: https://github.com/sphinx-doc/sphinx/pull/2444, https://github.com/Wikidata/StrepHit/tree/master/strephit/sphinx_wikisyntax

WikiCite 2016 attendees

Outcomes and impact

Outcomes as per stated goals

What are the results of your project?

Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.

The key planned outcomes of StrepHit are:

the Web Sources Corpus, composed of 1.8 M items circa gathered from 53 reliable Web sources;
the Natural Language Processing pipeline to extract Wikidata claims from free text;
the Web Sources Knowledge Base, composed of 2.6 M Wikidata claims circa.

Please use the table below to:

List each of your original measures of success (your targets) from your project plan.
List the actual outcome that was achieved.
Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?

Planned measure of success (include numeric target, if applicable)	Actual result	Explanation
Web Sources Production Corpus: 250 k documents, 50 sources	1,778,149 items, 515,212 documents, 53 sources	+265,212 (+106%) documents, +3 sources
Candidate Relations Set: 50 Wikidata relations	49 frames, 229 total frame elements, 133 unique frame elements, 69 unique Wikidata relations	+19 (+38%) Wikidata relations
StrepHit Pipeline Beta	releases: v. 1.0 beta, v. 1.1 beta	the first version is a working NLP pipeline. The second one contains improvements of the supervised classification system.
Web Sources Knowledge Base: 2.25 M Wikidata claims	842,208 confident + 958,491 supervised + 808,708 rule-based = 2,609,407 total claims	+359,407 (+16%) Wikidata claims. Note that we picked a subset of the rule-based output, with confidence scores > 0.8. The whole output actually contains 2,822,538 claims, and we would have obtained a much larger knowledge base: 4,623,237 total claims, thus +2,373,237 (+105%). However, we decided to discard potentially low-quality ones.
Primary Sources Tool	5 merged pull requests, active community discussion	On one hand, we have implemented the planned features. On the other, we have centralized the discussion on the tool usability and the available datasets.

Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?

Yes. Not only we achieved all the in-scope goals as per Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#In_Scope, but we also exceeded all the quantitative expectations.

Furthermore, we produced a set of bonus achievements.

Bonus Outcomes

Besides the planned goals, we reached the following bonus outcomes, in order of relevance to the Wikidata community:

the unresolved entities dataset. When generating the Web Sources Knowledge Base, a (rather large) set of entities could not be resolved to Wikidata QIDs. They may serve as candidates for new Wikidata Items;
the Wiki Loves Monuments for Wikidata prototype dataset. We were contacted by Wikimedia Italy to implement a very first integration of a WLM Italy dataset into Wikidata;
a rule-based statement extraction technique, which does not require any training set, although it may yield less accurate extractions. It can be thought as a trade-off between the text annotation and the statement validation costs;
the Italian companies dataset, as a result of the HackAtoka hackathon. It is a proof of scalability for the StrepHit pipeline: the rule-based technique has been succesfully applied to another domain (companies), in another language (Italian).

Classification Output

Amount of (1) sentences extracted from the input corpus, (2) classified sentences, and (3) generated Wikidata claims, with respect to confidence scores of linked entities

Performance values of the supervised classifier among a random sample of lexical units: (1) F1 scores via 10-fold cross validation, compared to a dummy classifier; (2) accuracy scores against a gold standard of 249 annotated sentences

Claim Correctness Evaluation

We carried out an empirical evaluation over the final output results, by randomly sampling 48 claims from the supervised and the rule-based datasets. Since StrepHit is a pipeline with several components, we computed the accuracy of those responsible for the actual generation of claims. Results indicate the ratio of correct data for each of them, as well as the overall claim correctness. The reader may refer to the BSc thesis [6] and the article [7] for full details of the system architecture.

Dataset	Claims	Linker	Classifier	Normalizer	Resolver	Overall
supervised	48	0.8125	0.781	1	0.285	0.638
rule-based	48	0.709	0.607	1	0.5	0.588

Sample Claims

Machine-readable ones are expressed in the QuickStatements syntax [8].

Correct Examples

Machine	Human
`Q18526540 P569 +00000001815-02-24T00:00:00Z/11 S854 "http://adb.anu.edu.au/biography/barkly-sir-henry-2936"`	According to the Australian Dictionary of Biography, Arthur Barkly was born on February 24, 1815
`Q16058737 P106 Q80687 S854 "https://ia902707.us.archive.org/1/items/biographicaldict08johnuoft/biographicaldict08johnuoft_djvu.txt"`	According to The Biographical Dictionary of America, Charles Millard Pratt has been a secretary
`Q515632 P69 Q1068752 S854 "http://www.nndb.com/people/215/000042089/"`	According to the Notable Names Database, Ossie Davis was educated at Howard University
`Q18922309 P937 Q777039 S854 "http://munksroll.rcplondon.ac.uk/Biography/Details/140"`	According to the Royal College of Physicians, Henry Ashby has worked at Guy's Hospital
`Q4861627 P19 Q739700 S854 "http://www.bbc.co.uk/arts/yourpaintings/artists/barnett-freedman"`	According to the BBC Your Paintings (now Art UK), Barnett Freedman was born in the East End of London

Wrong Examples

Machine	Human	Comments
`Q21454578 P463 Q42482 S854 "http://www.metal-archives.com/artists/Hugh_Gilmour/84280"`	According to Encyclopædia Metallum, Hugh Gilmour was a member of the Iron Maiden	possibly homonymous subject (incorrect resolution), incorrect classification
`Q28144 P101 Q1193470 S854 "http://www.museothyssen.org/en/thyssen/ficha_artista/301"`	According to the Thyssen-Bornemisza Museum, Willem Kalf's field of work is theme music	incorrect entity linking, incorrect classification
`Q3437676 P170 Q3908516 S854 "https://www.daao.org.au/bio/david-granger/"`	According to Design & Art Australia Online, David Granger is the creator of entrepreneurship	homonymous subject (incorrect resolution), incorrect classification

References Statistics

Domain	Confident	Supervised	Rule-based
adb.anu.edu.au	52,419	154,979	119,239
collection.britishmuseum.org	238,308	20,912	29,046
gameo.org	2,113	6,544	7,334
munksroll.rcplondon.ac.uk	4,114	18,438	12,649
archive.org	8,103	39,062	30,146
collection.cooperhewitt.org	2,383	11,550	13,677
sculpture.gla.ac.uk	1,663	1,474	1,182
dictionaryofarthistorians.org	1,358	3,620	4,969
en.wikisource.org	51,232	227,346	209,411
rkd.nl	44,690	N.A.	N.A.
structurae.net	1,851	N.A.	N.A.
vocab.getty.edu	213,436	6,137	4,052
www.bbc.co.uk	54,070	2,109	2,254
www.brown.edu	N.A.	1,200	1,144
www.daao.org.au	N.A.	26,848	21,256
www.genealogics.org	19,870	10,186	14,536
www.metal-archives.com	N.A.	760	1,796
www.museothyssen.org	1,468	1,498	2,096
www.newulsterbiography.co.uk	3,284	3,438	5,379
www.nndb.com	106,782	26,402	30,101
www.uni-stuttgart.de	20,627	N.A.	N.A.
www.wga.hu	9,762	5,088	5,944
yba.llgc.org.uk	4,645	6,912	9,599
Total	842,191	574,503	525,811
Grand total	1,942,505

Local Metrics

Metric	Achieved outcome	Explanation
1. Number of statements curated (approved + rejected) via the primary sources tool	127,072	It was not possible to measure the number of StrepHit-specific statements during the course of the project, since the final dataset (i.e., the Web Sources Knowledge Base), was expected at the end (cf. the last milestone in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline). We report instead the total number of curated statements, which we still believe to be a valid indicator of how StrepHit fostered the use of the primary sources tool.
2. Number of primary sources tool users	282 total, 10 of them also edited StrepHit data	It was not possible to fully assess this metric during the course of the project, for the same reason as the above one. The result shown was measured upon an unplanned bonus outcome, namely the Semi-structured Dataset (cf. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Bonus_Milestone:_Semi-structured_Development_Dataset)
3. Number of involved data donors from Open Data organizations	1 explicit, 2 potential	ContentMine has expressed its interest in a data donation to Wikidata via the primary sources tool, as per Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Support_from_ContentMine. We have also received informal statements by Openpolis (governmental data), and Till Sauerwein (biological data), which have not resulted in any written statement yet.
4. Wikidata request for comment process	wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements	The discussion on the primary sources tool and its datasets is centralized.

Global Metrics

We are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.

Next to each metric, list the actual numerical outcome achieved through this project.
Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."

For more information and a sample, see Global Metrics.

Metric	Achieved outcome	Explanation
1. Number of active editors involved	282 total, 10 of them also edited StrepHit data	This global metric naturally maps to #2 of the local ones (cf. #Local_Metrics).
2. Number of new editors	unknown	Not sure how to measure this.
3. Number of individuals involved	80 (estimated)	The dissemination activities have led to the involvement of several individuals, ranging from seminar attendees (both physical and remote), to hackathon participants, all the way to software contributors. The reported outcome is a rough estimate.
4. Number of new images/media added to Wikimedia articles/pages	0	Not a goal of the project.
5. Number of articles added or improved on Wikimedia projects	2,609,407 Wikidata claims	This global metric naturally maps to the Web Sources Knowledge Base in-scope goal (cf. #Outcomes_as_per_stated_goals)
6. Absolute value of bytes added to or deleted from Wikimedia projects	N.A.	Most of StrepHit data will undergo a validation step via the primary sources tool before their eventual inclusion into Wikidata. Hence, this metric can only be measured after that, and does not represent a relevant indicator of the actual content modification of this project anyway. On the other hand, the confident subset of the data will be directly uploaded after approval by the community.

Learning question: Did your work increase the motivation of contributors, and how do you know?

Probably yes, but this is difficult to record. Besides the request for comment and the past endorsements, we report below the most prominent examples among the positive feedback collected so far:

Indicators of impact

Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.

Improve quality

The trustworthiness of Wikidata is an essential requirement for high quality: this entails the addition of references to authenticate its content. However, a large amount of assertions still lacks of references. StrepHit ultimately aims at enhancing the quality of Wikidata statements via references to third-party autorithative Web sources.

The system is designed to guarantee at least one reference per statement, and has achieved to do so for a total of 1,942,505 statements. This clearly indicates the quality improvement.

Increase participation

We believe to have spent considerable effort in requesting feedback on the primary sources tool and its dataset. This was mainly achieved through the hackathons and the work group at WikiCite 2016.

Increase reach

Outreach and dissemination activities for increasing the readership have been a high priority from the very beginning. We provide pointers to the full list at #Dissemination.

Project resources

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

Learning

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What didn’t work

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

The lessons learnt reported in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Challenges still apply;
in general, we faced two major unplanned tasks, which affected the overall schedule of the project:
1. construction of a suitable lexical database, since FrameNet failed to meet our needs;
2. second revision of the scientific article.

Both had a negative impact in the most delicate planned task, namely building the crowdsourced training set.

We had to sum additional issues related to the crowdsourcing platform and the nature of the input corpus. Respectively:
- high execution time for certain lexical units that are not trivial to annotate (at the time of writing this report, some jobs are still running);
- high percentage of sentence chunks that cannot be labeled with any frame element (more than 50% on average), which resulted in a relatively large amount of empty sentences even after the annotation.
This prevented us from reaching a sufficient amount of training samples, thus causing a generally low performance of the supervised classifier, depending on the lexical unit;
Finding a general-purpose method to serialize the classification results into Wikidata assertions was impossible, since we needed to understand the intendend meaning of each Wikidata property, i.e., how it is used to represent the Wikidata world.

Next steps and opportunities

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

StrepHit:
- Extend language capabilities to high-coverage languages, namely Spanish, French, and Italian;
- improve the performance of the supervised extraction;
- fix open issues.
Primary sources tool:
- take control over the codebase;
- tackle usability issues;
- implement known feature requests.

Think your project needs renewed funding for another 6 months?

Part 2: The Grant

Finances

Actual spending

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

As mentioned in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Finances, we adjusted the dissemination budget item, due to the unexpected low cost of the planned activities. Due to the issues detailed in #What_didn't_work, we also adjusted the training set one.

Expense	Approved amount	Actual funds spent	Difference
Project Leader	$21,908
NLP Developer	$7,160
Training Set	$500
Dissemination	$432
Total	$30,000

Remaining funds

Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.

No.

If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:

Documentation

Did you send documentation of all expenses paid with grant funds to grantsadmin wikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

Yes.

Confirmation of project status

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

Yes.

Is your project completed?

Please answer yes or no.

Regarding this 6-month scope, yes.

Grantee reflection

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!

The grant really allowed us to get involved into the community. We believe the IEG program is a complete success with respect to the individual engagement process (and the program title is a perfect fit);
During the events we attended, we met lots of Wikimedians in person, and felt that we all share a huge amount of enthusiasm;
Quoting an earlier reflection: "the Wikimedian community seems to have a silent minority, instead of majority: when asking for feedback, we always received constructive answers". This is indeed a virtuous circle: during the round 1 2016 IEG call, Ester, a grantee candidate, approached us. We were just pleased to help her improve the proposal, as much as we got invaluable support when writing ours;
Thanks to Marti for the setup, we had the chance to have lunch with other IEGrantees at Wikimania 2016. It was just great to meet them, we really felt part of a community.
We are extremely thankful to all the people who helped us during the course of the project. Making an exhaustive list would be impossible! Special thanks go to the Wikidata team, the Wikimedia research team, the IEG program officers, the IEG reviewers, and the endorsers.

Grants:IEG/StrepHit: Wikidata Statements Validation via References/Final

Contents

Part 1: The Project

Summary

Methods and activities

Dissemination

Side Projects

Outcomes and impact

Outcomes as per stated goals

Bonus Outcomes

Classification Output

Claim Correctness Evaluation

Sample Claims

Correct Examples

Wrong Examples

References Statistics

Local Metrics

Global Metrics

Indicators of impact

Improve quality

Increase participation

Increase reach

Project resources

Data

Technical

Research

Dissemination

Learning

What worked well

What didn’t work

Next steps and opportunities

Part 2: The Grant

Finances

Actual spending

Remaining funds

Documentation

Confirmation of project status

Grantee reflection