Grants:IEG/StrepHit: Wikidata Statements Validation via References/Final

Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.

Part 1: The Project




In a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.

       Planned achievements, as per the project goals and timeline:
  1. Web Sources Production Corpus: 1.8 M items, 515 k documents (biographies), 53 reliable sources;
  2. Candidate Relations Set: 49 frames, 229 total frame elements, 133 unique frame elements, 69 unique Wikidata relations;
  3. StrepHit Pipeline Beta: v. 1.0 beta, v. 1.1 beta;
  4. Web Sources Knowledge Base: 842 k confident claims + 958 k supervised + 808 k rule-based = 2.6 M total claims;
  5. Primary Sources Tool: 5 merged pull requests, active request for comment.
       Bonus achievements, beyond the goals:
  1. Web Sources Corpus: +265 k (+106%) documents, +3 sources;
  2. Candidate Relations Set: +19 (+38%) Wikidata relations;
  3. Web Sources Knowledge Base: +359 k (+16%) Wikidata claims;
  4. Candidate Items dataset: a set of entities found in the corpus that could be added to Wikidata (needs validation);
  5. Wiki Loves Monuments Italy: a prototype dataset for Wikidata;
  6. Italian companies dataset: a proof-of-scalability dataset (another language, another domain), as a result of the HackAtoka hackathon.
       Codebase: 9,425 lines of Python code, 2 releases, 474 commits, 16 open issues, 37 closed issues.

Methods and activities


What did you do in your project?

Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 3 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.

Main room for the hackathon held at the public library of Trento, Italy, during the Spaghetti Open Data Reunion 2016

The project has been managed as per Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Methods_and_activities.



As detailed in the planned outreach activities, the April and May monthly reports, we conducted the following dissemination efforts after the midpoint.

Side Projects


Besides StrepHit, we have been contributing to the following projects:

WikiCite 2016 attendees

Outcomes and impact


Outcomes as per stated goals


What are the results of your project?

Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.

The key planned outcomes of StrepHit are:

  • the Web Sources Corpus, composed of 1.8 M items circa gathered from 53 reliable Web sources;
  • the Natural Language Processing pipeline to extract Wikidata claims from free text;
  • the Web Sources Knowledge Base, composed of 2.6 M Wikidata claims circa.

Please use the table below to:

  1. List each of your original measures of success (your targets) from your project plan.
  2. List the actual outcome that was achieved.
  3. Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
Web Sources Production Corpus: 250 k documents, 50 sources 1,778,149 items, 515,212 documents, 53 sources +265,212 (+106%) documents, +3 sources
Candidate Relations Set: 50 Wikidata relations 49 frames, 229 total frame elements, 133 unique frame elements, 69 unique Wikidata relations +19 (+38%) Wikidata relations
StrepHit Pipeline Beta releases: v. 1.0 beta, v. 1.1 beta the first version is a working NLP pipeline. The second one contains improvements of the supervised classification system.
Web Sources Knowledge Base: 2.25 M Wikidata claims 842,208 confident + 958,491 supervised + 808,708 rule-based = 2,609,407 total claims +359,407 (+16%) Wikidata claims. Note that we picked a subset of the rule-based output, with confidence scores > 0.8. The whole output actually contains 2,822,538 claims, and we would have obtained a much larger knowledge base: 4,623,237 total claims, thus +2,373,237 (+105%). However, we decided to discard potentially low-quality ones.
Primary Sources Tool 5 merged pull requests, active community discussion On one hand, we have implemented the planned features. On the other, we have centralized the discussion on the tool usability and the available datasets.

Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?

Yes. Not only we achieved all the in-scope goals as per Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#In_Scope, but we also exceeded all the quantitative expectations.

Furthermore, we produced a set of bonus achievements.

Bonus Outcomes


Besides the planned goals, we reached the following bonus outcomes, in order of relevance to the Wikidata community:

  1. the unresolved entities dataset. When generating the Web Sources Knowledge Base, a (rather large) set of entities could not be resolved to Wikidata QIDs. They may serve as candidates for new Wikidata Items;
  2. the Wiki Loves Monuments for Wikidata prototype dataset. We were contacted by Wikimedia Italy to implement a very first integration of a WLM Italy dataset into Wikidata;
  3. a rule-based statement extraction technique, which does not require any training set, although it may yield less accurate extractions. It can be thought as a trade-off between the text annotation and the statement validation costs;
  4. the Italian companies dataset, as a result of the HackAtoka hackathon. It is a proof of scalability for the StrepHit pipeline: the rule-based technique has been succesfully applied to another domain (companies), in another language (Italian).

Classification Output

Amount of (1) sentences extracted from the input corpus, (2) classified sentences, and (3) generated Wikidata claims, with respect to confidence scores of linked entities
Performance values of the supervised classifier among a random sample of lexical units: (1) F1 scores via 10-fold cross validation, compared to a dummy classifier; (2) accuracy scores against a gold standard of 249 annotated sentences

Claim Correctness Evaluation


We carried out an empirical evaluation over the final output results, by randomly sampling 48 claims from the supervised and the rule-based datasets. Since StrepHit is a pipeline with several components, we computed the accuracy of those responsible for the actual generation of claims. Results indicate the ratio of correct data for each of them, as well as the overall claim correctness. The reader may refer to the BSc thesis [6] and the article [7] for full details of the system architecture.

Dataset Claims Linker Classifier Normalizer Resolver Overall
supervised 48 0.8125 0.781 1 0.285 0.638
rule-based 48 0.709 0.607 1 0.5 0.588

Sample Claims


Machine-readable ones are expressed in the QuickStatements syntax [8].

Correct Examples

Machine Human
Q18526540 P569 +00000001815-02-24T00:00:00Z/11 S854 "" According to the Australian Dictionary of Biography, Arthur Barkly was born on February 24, 1815
Q16058737 P106 Q80687 S854 "" According to The Biographical Dictionary of America, Charles Millard Pratt has been a secretary
Q515632 P69 Q1068752 S854 "" According to the Notable Names Database, Ossie Davis was educated at Howard University
Q18922309 P937 Q777039 S854 "" According to the Royal College of Physicians, Henry Ashby has worked at Guy's Hospital
Q4861627 P19 Q739700 S854 "" According to the BBC Your Paintings (now Art UK), Barnett Freedman was born in the East End of London

Wrong Examples

Machine Human Comments
Q21454578 P463 Q42482 S854 "" According to Encyclopædia Metallum, Hugh Gilmour was a member of the Iron Maiden possibly homonymous subject (incorrect resolution), incorrect classification
Q28144 P101 Q1193470 S854 "" According to the Thyssen-Bornemisza Museum, Willem Kalf's field of work is theme music incorrect entity linking, incorrect classification
Q3437676 P170 Q3908516 S854 "" According to Design & Art Australia Online, David Granger is the creator of entrepreneurship homonymous subject (incorrect resolution), incorrect classification

References Statistics

Domain Confident Supervised Rule-based 52,419 154,979 119,239 238,308 20,912 29,046 2,113 6,544 7,334 4,114 18,438 12,649 8,103 39,062 30,146 2,383 11,550 13,677 1,663 1,474 1,182 1,358 3,620 4,969 51,232 227,346 209,411 44,690 N.A. N.A. 1,851 N.A. N.A. 213,436 6,137 4,052 54,070 2,109 2,254 N.A. 1,200 1,144 N.A. 26,848 21,256 19,870 10,186 14,536 N.A. 760 1,796 1,468 1,498 2,096 3,284 3,438 5,379 106,782 26,402 30,101 20,627 N.A. N.A. 9,762 5,088 5,944 4,645 6,912 9,599
Total 842,191 574,503 525,811
Grand total 1,942,505

Local Metrics

Metric Achieved outcome Explanation
1. Number of statements curated (approved + rejected) via the primary sources tool 127,072 It was not possible to measure the number of StrepHit-specific statements during the course of the project, since the final dataset (i.e., the Web Sources Knowledge Base), was expected at the end (cf. the last milestone in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline). We report instead the total number of curated statements, which we still believe to be a valid indicator of how StrepHit fostered the use of the primary sources tool.
2. Number of primary sources tool users 282 total, 10 of them also edited StrepHit data It was not possible to fully assess this metric during the course of the project, for the same reason as the above one. The result shown was measured upon an unplanned bonus outcome, namely the Semi-structured Dataset (cf. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Bonus_Milestone:_Semi-structured_Development_Dataset)
3. Number of involved data donors from Open Data organizations 1 explicit, 2 potential ContentMine has expressed its interest in a data donation to Wikidata via the primary sources tool, as per Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Support_from_ContentMine. We have also received informal statements by Openpolis (governmental data), and Till Sauerwein (biological data), which have not resulted in any written statement yet.
4. Wikidata request for comment process wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements The discussion on the primary sources tool and its datasets is centralized.

Global Metrics


We are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.

  1. Next to each metric, list the actual numerical outcome achieved through this project.
  2. Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."

For more information and a sample, see Global Metrics.

Metric Achieved outcome Explanation
1. Number of active editors involved 282 total, 10 of them also edited StrepHit data This global metric naturally maps to #2 of the local ones (cf. #Local_Metrics).
2. Number of new editors unknown Not sure how to measure this.
3. Number of individuals involved 80 (estimated) The dissemination activities have led to the involvement of several individuals, ranging from seminar attendees (both physical and remote), to hackathon participants, all the way to software contributors. The reported outcome is a rough estimate.
4. Number of new images/media added to Wikimedia articles/pages 0 Not a goal of the project.
5. Number of articles added or improved on Wikimedia projects 2,609,407 Wikidata claims This global metric naturally maps to the Web Sources Knowledge Base in-scope goal (cf. #Outcomes_as_per_stated_goals)
6. Absolute value of bytes added to or deleted from Wikimedia projects N.A. Most of StrepHit data will undergo a validation step via the primary sources tool before their eventual inclusion into Wikidata. Hence, this metric can only be measured after that, and does not represent a relevant indicator of the actual content modification of this project anyway. On the other hand, the confident subset of the data will be directly uploaded after approval by the community.

Learning question
Did your work increase the motivation of contributors, and how do you know?

Probably yes, but this is difficult to record. Besides the request for comment and the past endorsements, we report below the most prominent examples among the positive feedback collected so far:

Indicators of impact


Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.

Improve quality


The trustworthiness of Wikidata is an essential requirement for high quality: this entails the addition of references to authenticate its content. However, a large amount of assertions still lacks of references. StrepHit ultimately aims at enhancing the quality of Wikidata statements via references to third-party autorithative Web sources.

The system is designed to guarantee at least one reference per statement, and has achieved to do so for a total of 1,942,505 statements. This clearly indicates the quality improvement.

Increase participation


We believe to have spent considerable effort in requesting feedback on the primary sources tool and its dataset. This was mainly achieved through the hackathons and the work group at WikiCite 2016.

Increase reach


Outreach and dissemination activities for increasing the readership have been a high priority from the very beginning. We provide pointers to the full list at #Dissemination.

Project resources


Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.











The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well


What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What didn’t work


What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

Both had a negative impact in the most delicate planned task, namely building the crowdsourced training set.
  • We had to sum additional issues related to the crowdsourcing platform and the nature of the input corpus. Respectively:
    • high execution time for certain lexical units that are not trivial to annotate (at the time of writing this report, some jobs are still running);
    • high percentage of sentence chunks that cannot be labeled with any frame element (more than 50% on average), which resulted in a relatively large amount of empty sentences even after the annotation.
  • This prevented us from reaching a sufficient amount of training samples, thus causing a generally low performance of the supervised classifier, depending on the lexical unit;
  • Finding a general-purpose method to serialize the classification results into Wikidata assertions was impossible, since we needed to understand the intendend meaning of each Wikidata property, i.e., how it is used to represent the Wikidata world.

Next steps and opportunities


Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

  • StrepHit:
    • Extend language capabilities to high-coverage languages, namely Spanish, French, and Italian;
    • improve the performance of the supervised extraction;
    • fix open issues.
  • Primary sources tool:
    • take control over the codebase;
    • tackle usability issues;
    • implement known feature requests.
Think your project needs renewed funding for another 6 months?

Part 2: The Grant




Actual spending


Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

As mentioned in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Finances, we adjusted the dissemination budget item, due to the unexpected low cost of the planned activities. Due to the issues detailed in #What_didn't_work, we also adjusted the training set one.

Expense Approved amount Actual funds spent Difference
Project Leader $21,908
NLP Developer $7,160
Training Set $500
Dissemination $432
Total $30,000

Remaining funds


Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.


If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:



Did you send documentation of all expenses paid with grant funds to grantsadmin, according to the guidelines here?

Please answer yes or no. If no, include an explanation.


Confirmation of project status


Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.


Is your project completed?

Please answer yes or no.

Regarding this 6-month scope, yes.

Grantee reflection


We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!

  • The grant really allowed us to get involved into the community. We believe the IEG program is a complete success with respect to the individual engagement process (and the program title is a perfect fit);
  • During the events we attended, we met lots of Wikimedians in person, and felt that we all share a huge amount of enthusiasm;
  • Quoting an earlier reflection: "the Wikimedian community seems to have a silent minority, instead of majority: when asking for feedback, we always received constructive answers". This is indeed a virtuous circle: during the round 1 2016 IEG call, Ester, a grantee candidate, approached us. We were just pleased to help her improve the proposal, as much as we got invaluable support when writing ours;
  • Thanks to Marti for the setup, we had the chance to have lunch with other IEGrantees at Wikimania 2016. It was just great to meet them, we really felt part of a community.
  • We are extremely thankful to all the people who helped us during the course of the project. Making an exhaustive list would be impossible! Special thanks go to the Wikidata team, the Wikimedia research team, the IEG program officers, the IEG reviewers, and the endorsers.