Grants:IEG/StrepHit: Wikidata Statements Validation via References/Renewal/Final


Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 12-month project.

Part 1: The Project edit

Summary edit

In a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.

       Planned outcomes, based on the scope and the timeline:
  1. Primary sources tool back end version 2: a Wikidata Query Service module;
  2. Primary sources tool front end version 2: a MediaWiki extension;
  3. StrepHit confident dataset version 2: 497,247 statements, 3,326,446 RDF triples;
  4. StrepHit supervised dataset version 2: 574,143 statements, 4,040,460 RDF triples.
       Extra outcomes, besides the scope:
  1. QuickStatements to Wikidata RDF converter: transform a community-specific format to a mature Web standard;
  2. Dataset booster: elevate simple reference URLs to fully structured references;
  3. Several contributions to Wiki pages.

Methods and activities edit

What did you do in your project?

Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 6 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.

The project has been conducted in the same way as Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal/Midpoint#Methods_and_activities.

Software development edit

The team has almost fully focused on the release of the primary sources tool (PST from now on) version 2, which required work until the very end of the project.

 
Screenshot of the PST filter module

Datasets edit

  • We still managed to publish version 2 of the StrepHit datasets, with refreshed URLs and fully structured references;
  • we have started a discussion with Tpt and Lydia_Pintscher_(WMDE) to understand why the import of Freebase has stalled, and came up with two main causes:
    1. the datasets have quality issues, with lots of unreferenced statements and blacklisted references;
    2. version 1 of the PST was pretty much unusable.

In this project, we tried to tackle the latter reason, while the former is out of scope, and left open in the Related to Freebase column on the Phabricator work board: phab:project/board/2788/.

Side projects edit

Besides the PST, we:

Outcomes and impact edit

Outcomes edit

What are the results of your project?

Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.

 
Screenshot of the dataset ingestion special page, allowing data providers to upload or update their datasets

Planned edit

Please use the below table to:

  1. List each of your original measures of success (your targets) from your project plan.
  2. List the actual outcome that was achieved.
  3. Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
PST back end redesign Back end version 2 beta Component totally rewritten from scratch as a Wikidata Query Service module. The choice to fork such a central project for Wikidata was made for self-sustainability and standardization purposes.
PST front end redesign Front end version 2 beta Component ported from a Wikidata gadget to a MediaWiki extension and partially rewritten. The same rationales as the back end apply.
Make the primary sources list usable Filter module Totally rewritten module, with workflow inspired from the previous version. Now a key part of the tool.
Developer community engagement Code reviews Contributed Gerrit patches and GitHub pull requests allowed to attract developers outside of the team.
Standard dataset release flow for third-party providers Ingestion API Data providers are now enabled to upload and update their datasets.
StrepHit datasets version 2 Confident dataset + supervised dataset, version 2 Datasets with fully structured up-to-date references.
StrepHit lexical database version 2 None Out of time
StrepHit direct inclusion dataset None Out of time
StrepHit unresolved entities dataset None Out of time


 
Screenshot of the PST filter module, displaying the entity of interest filter with autocompletion

Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?

With respect to the PST part, yes. On the other hand, we partially met the StrepHit goals, due to the unexpected workload that emerged during the development of the tool version 2. See the reasons in #What didn’t work.

Extra edit

The scheduled tasks on StrepHit led to the development of additional projects, aiming at the same target: to homogenize the release flow for third-party data providers.

  1. the QuickStatements to Wikidata RDF converter keeps the support for the QuickStatements format (compact and widespread in the Wikidata community, albeit non-standard) and alleviates the burden of the RDF format (more complex, yet a mature Web standard);
  2. the dataset booster is designed to make statement references uniform and as rich as possible.

Local Metrics edit

Metric Achieved outcome Explanation
1. Number of statements curated (approved + rejected) through the PST 394,870 Version 2 is not deployed in Wikidata yet: the reported value refers to the previous version. We still view it as an impact measure of this project.
2. Engagement of open data organizations None We have not reached this phase, due to the unforeseen implementation efforts required by the PST.
3. Number of PST users 695 Same reasons as metric 1.


 
Screenshot of the PST filter module, displaying the ready-to-use baked filters

Global Metrics edit

We are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.

  1. Next to each metric, list the actual numerical outcome achieved through this project.
  2. Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."

For more information and a sample, see Global Metrics.

Metric Achieved outcome Explanation
1. Number of active editors involved 695 This global metric corresponds to local metric 3.
2. Number of new editors 268 Difference between metric 1 and the total users reported in the proposal: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#cite_note-4.
3. Number of individuals involved 150 (estimated) Sum of the participants in the main dissemination events.
4. Number of new images/media added to Wikimedia articles/pages 13 Not a target of this project, but still measurable.
5. Number of articles added or improved on Wikimedia projects 1,071,390 Wikidata statements Sum of the StrepHit datasets version 2 statements. Not actually added to Wikidata until curation through the PST occurs.
6. Absolute value of bytes added to or deleted from Wikimedia projects N.A. Not measurable until curation through the PST occurs.


 
Screenshot of a MediaWiki instance sidebar with the PST extension enabled
Learning question
Did your work increase the motivation of contributors, and how do you know?

It is not easy to give an objective answer. We list below some relevant feedback, both positive and negative:

 
Screenshot of the PST filter module, instructions pop-up
 
Screenshot of the dataset selection window, also showing essential information on each dataset

Indicators of impact edit

Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.

Improve quality
It is definitely the central goal since the very beginning of StrepHit: the lack of references in Wikidata is a significant obstacle to its data quality. With version 2, we refined the StrepHit statements and enriched their references. At the time of writing this report, we do not have any specific indicator for them, because the new version of the PST is not deployed in Wikidata yet. Nevertheless, a total of 287,988 statements got approved (read included into Wikidata) so far: this is clearly a sign of impact.
Increase participation
268 new users of the PST since the StrepHit renewal proposal evidently indicate the impact.
increase reach
We expect that our efforts towards the flow standardization for third-party data providers will yield impact on this strategic priority.

Project resources edit

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

Wikidata primary sources tool edit

Datasets edit

Dissemination edit

Contributions to Wiki pages edit

Side projects edit

Learning edit

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well edit

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What didn’t work edit

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

The following set of issues related to the PST part had a highly negative effect on the whole project plan, preventing the team from effectively addressing the StrepHit one:

  1. the front end version 1 was designed in an inflexible way, which precluded both the implementation of some desired features and appropriate code refactoring;
  2. we felt that MediaWiki software development often follows peculiar practices, needing a longer time to understand what to do and how;
  3. scattered outdated and sometimes redundant documentation make the duty even harder;
  4. the development of a MediaWiki extension has a steep learning curve;
  5. the Wikidata Query Service is a big Java project and resulted in an overkill for our purposes. More specifically:
    • relatively straightforward Web services actually demanded a lot of code;
    • the deployment had high memory requirements, excluding the option of a Toolforge machine.

Other recommendations edit

If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.

Quoting a sentence from the midpoint report:

Next steps and opportunities edit

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

As already mentioned, the team almost totally devoted its efforts to the PST. Therefore, a significant amount of work is still needed to take StrepHit to the next level, and can result in a new grant proposal. See the section below.

StrepHit datasets edit

  1. The entity reconciliation task was responsible for most errors in the final output, as highlighted in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Final#Claim_Correctness_Evaluation (Resolver column). This will be a key activity of the upcoming soweego project, where we expect to develop methods that are likely to be useful for StrepHit as well;
  2. the knowledge representation task caters for the conversion between facts extracted from natural language and Wikidata statements. It plays an essential role in the actual correctness of the final output. The lexical database behind this task still needs a rethink;
  3. the planned bot to import statements above a high confidence threshold has to be implemented;
  4. the dataset of unresolved entities has to be analyzed yet.

Primary sources tool edit

  • Open tasks: phab:project/board/2788/query/open/;
  • Technical details for the back end:
    • the Blazegraph data loader already performs RDF syntax check;
    • for scalability purposes, reading RDF in streaming mode with NT serialization, instead of loading Turtle in memory;
    • add rdf:type triples to qualifiers and references, and investigate the trade-off between faster execution of queries and higher volume of datasets;
    • implement a way to automatically reindex the StrepHit corpus when reference URLs change;
  • Technical details for the front end:
    • optimize the query for the Entity of interest filter;
    • understand the optimal SPARQL query limit threshold to avoid empty result tables in the autocompletion filters;
    • enable the support for on-the-fly reference editing on suggested statements.

Freebase (third-party dataset) edit

Significant effort is required to slice this huge dataset into a suitable subset for the PST. See the Related to Freebase column in the Phabricator project page for more details: phab:project/board/2788/.

Part 2: The Grant edit

Finances edit

Actual spending edit

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

Expense Approved amount Actual funds spent Difference
Project Leader 41,379 € 43,260 € Balancing based on the other items.
WikiCite 2017(1) 1,196 € 106 € The rescheduled event took place near the grantee's physical location.
Wikimania 2017 2,079 € 1,838 € Flight costs were a little lower than the estimated ones.
Training Set 550 € 0 € Not enough time for task S2.
Total 45,204 € 45,204 €


(1) replaces Wikimedia Developer Summit 2017.

Remaining funds edit

Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.

No.

Documentation edit

Did you send documentation of all expenses paid with grant funds to grantsadmin wikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

Yes.

Confirmation of project status edit

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

Yes.

Is your project completed?

Please answer yes or no.

Yes, although some tasks are left open, particularly those tagged as Epic in the To do column: phab:project/board/2788/.

Grantee reflection edit

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!

We would just like to thank all the great people from the Wikimedia world who helped us build the new version of the PST. The list is in chronological order and may not be exhaustive, so please forgive us if we missed you!

  • LZia_(WMF), research scientist at WMF, for her irreplaceable technical advice and precise reviews of this project;
  • Mjohnson_(WMF), program officer at WMF, for being a tower of strength throughout the grant, and for setting up the grantees luncheon at Wikimania 2017;
  • Jtud_(WMF), grants administrator at WMF, for the prompt crystal clear written communications;
  • Dario_(WMF), head of research at WMF, for organizing the key event WikiCite and for converging to a shared research vision;
  • Lydia_Pintscher_(WMDE) and the whole Wikidata team for their incalculable ceaseless support;
  • Tpt, core developer of the gadget version and publisher of the Freebase datasets, for his keen guidance;
  • T_Arrow and the WikiFactMine team, for the crucial conversations;
  • Sjoerddebruin, power Wikidata user, for the regular interactions;
  • Smalyshev_(WMF), core developer of the mw:Wikidata_query_service, for his code reviews and the fruitful discussion at Wikimania 2017;
  • GLederrey_(WMF), operations engineer at WMF, for his priceless Java tips and best practices;
  • BDavis_(WMF), engineering manager at WMF, for his vital assistance on the Cloud VPS infrastructure.