Grants:Project/MFFUK/Wikidata & ETL

This project is funded by a Project Grant

statusselected

Project Grants

Wikidata & ETL

summaryThis project aims at improving management and increasing automation of processes loading data into Wikidata.

targetWikidata

type of granttools and software

amount$39,687.50

nonprofitYes

grantee• Jakub Klímek• Škoda Petr

advisor• Vojtěch Dostál

contact• klimek@ksi.mff.cuni.cz• Jakub Klímek - klimek@ksi.ms.mff.cuni.cz

volunteer• Jiří Sedláček

organization• Faculty of Mathematics and Physics of the Charles University

this project needs...

volunteer

give feedback

join

endorse

created on07:08, 6 November 2018 (UTC)

Friendly space expectations

Project idea

What is the problem you're trying to solve?

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

Currently, Wikidata, or any other Wikibase instance, is being populated from external data sources mostly manually, by creating ad-hoc data transformation scripts. Usually, these scripts are run once, and that is it. Given the heterogeneity of the source data and languages used to transform them, this means the scripts are hard or impossible to maintain and unable to run periodically in an automated fashion to keep Wikidata up-to-date.

At the same time, more and more interesting open data sources emerge, ready to be ingested into Wikidata. Without a methodology, tooling support and a set of best practices and examples, they will be left unexploited, or they will be transformed in the same, chaotic way as before.

Existing approaches to programmatically load data into such as Pywikibot and Wikidata integrator require coding in Python. QuickStatements is not suited for automatic bulk data loading and OpenRefine is limited to tabular data and focuses more on manual tinkering with the data than the bulk loading process. Wikibase_Universal_Bot is a Python library that automatically load bulk data from tabular format (csv file) into a Wikibase instance (such as Wikidata) once given a user-defined data model (in yaml format).

What is your solution to this problem?

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

We propose to utilize LinkedPipes ETL (extract-transform-load) - an open-source ETL tool - as a platform for creation of repeatable processes for bulk loading data into Wikidata and other Wikibase instances from various data sources. Further, we propose to create a Wikidata specific methodology for the usage of the tool in this manner. We propose to experimentally verify this approach by transforming a set of interesting open data sources and loading them into Wikidata using the methodology and the tool. This way, the data ingestion pipelines created by Wikipedians will have a consistent structure. The data sources will be clearly identified and localized, their transformations will be done using standard languages such as SPARQL or XSLT, and also the resulting data will be loaded into Wikidata consistently. Part of the methodology will also deal with the problem of managing Wikidata Item IDs while bulk loading data, to help with avoiding duplicates of items and data statements about them.

Given that the Wikidata data model is different from the RDF data model, on which LinkedPipes ETL currently focuses, new components and adjustments to the tool itself will have to be developed so that the tool is more usable for Wikidata.

This tool was chosen as a solution because it is already in use for Linked Data publishing tasks in various institutions in academia, the Czech public administration, and partners of the OpenBudgets.eu H2020 project, and possibly more, as it is freely available on GitHub. It is also actively developed at the Charles University by the grantee, which proves we have the necessary know-how to develop the tool further.

The tool is intended to be installed on the laptops of individual users, or on non-WMF servers. Therefore, neither code nor security review is necessary.

Project goals

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

The first goal of the project is to demonstrate the repeatable Wikidata data ingestion process on several proof-of-concept use cases for different types of data, possibly improving the LinkedPipes ETL tool in the process.
The second goal of the project is to create a methodology (guide, tutorial) describing how volunteers can contribute to Wikidata content using this tool in a systematic and repeatable way, illustrated by the proof-of-concept transformations.

Project impact

How will you know if you have met your goals?

For each of your goals, we’d like you to answer the following questions:

During your project, what will you do to achieve this goal? (These are your outputs.)
Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

Proof-of-concept data transformations
1. We will identify 3 different types of data to be ingested into Wikidata. For each type, we will identify at least 1 data source (at least 5 in total), for which a repeatable data transformation pipeline will be created, documented and published. The community will be consulted for feedback and at least 2 new volunteers will be working with the tool.
2. Once the project is over, these will serve as examples and as a base for tutorials on how to create more similar transformations. At the same time, unless the source data format changes, they can be repeatedly run to update Wikidata as the source is updated.
Methodology (tutorial, guide)
1. We will produce a set of documents describing how to use LinkedPipes ETL to enrich Wikidata with transformations of additional data sources.
2. Once the project is over, volunteers will be able to follow the tutorial to add more and more content to Wikidata in a systematic way - using LinkedPipes ETL pipelines.

Do you have any goals around participation or content?

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.

2 new authors (tech savvy users) of data transformation pipelines loading data into Wikidata created according to our methodology
5 new data sources loaded into Wikidata using LinkedPipes ETL

Project plan

Activities

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project? The work will be split into work packages according to the types of work needed to be done.

Analysis and design

Description - The processes of bulk loading of data to Wikidata will be analyzed and the necessary implementation steps in LinkedPipes ETL will be specified.
Outputs - Document describing steps necessary to load data to Wikidata using LinkedPipes ETL and what is necessary to implement.

Implementation

Description - The necessary implementation in LinkedPipes ETL will be done.
Outputs - New components and core enhancements needed for loading data to Wikidata

Transformations

Description - The proof of concept data sources to be loaded into Wikidata will be identified, analyzed, transformed, and, if their nature allows, their regular updates scheduled and observed.
Outputs - LinkedPipes ETL pipelines for transformation and updates of individual data sources

Documentation

Description - The process of creation, monitoring and maintenance of data transformation and loading processes will be documented in the form of a user-focused tutorial
Outputs - Tutorial for volunteers wanting to load data to Wikidata illustrated by the proof of concept pipelines

Communication

Description - Promotion of project outputs, feedback gathering, presentation at Wikimania, tutoring of interested volunteers
Outputs - Blog posts, feedback report

Project management

Description - Reporting
Outputs - Progress reports, Midpoint report, Final report

WP/Month	1	2	3	4	5	6	7	8
WP1 - Analysis and design	X	X	X
WP2 - Implementation		X	X	X	X	X
WP3 - Transformations		X	X	X	X	X	X
WP4 - Documentation					X	X	X	X
WP5 - Communication				X	X	X	X	X
WP6 - Project management	X	X	X	X	X	X	X	X

Budget

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

Item	Budget
Senior SW Developer (15 hours per week for 8 months)	$11,250.00
Senior Data scientist (15 hours per week for 8 months)	$14,000.00
Project manager (3 hours per week for 8 months)	$2,500.00
Server hosting	$0.00 (not needed)
Code/security review	$0.00 (not needed)
Travel (Wikimania 2019, 2 people)	$4,000.00
University overhead 20%	$7,937.50
Total	$39,687.50

Community engagement

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

Semi-regular updates at Wikidata GLAM Facebook page (a Facebook page of choice for many of power users in the field) and through Wikidata News
Blog post at LinkedPipes ETL website and on Wikimedia channels (aiming for Wikimedia blog post, if possible)
Workshop at Wikimania (14th-18th August 2019 @ Stockholm) - We aim to have a prototype to play with at Wikimania 2019

Get involved

Participants

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Jakub Klímek (Data scientist and Project manager) - Linked Data enthusiast, data scientist, author of multitude of data scraping and transformation pipelines, participant in multiple EU and regional projects, assistant professor at Charles University and the Czech Technical University in Prague, and a Linked and Open Data expert at the Ministry of the Interior of the Czech Republic.
Petr Škoda (Senior SW Developer) - fluent in Java, JavaScript and web technologies in general, main developer of LinkedPipes ETL (and its predecessor, UnifiedViews), Ph.D. student at Charles University
Vojtěch Dostál (volunteer) - coordinator of the Wikidata/Tech program of Wikimedia Czech Republic, chair of Wikimedia Czech Republic, involved in helping with grant proposal and consulting
Jiří Sedláček (volunteer) - community member of Wikidata, bot operator

Community notification

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Discussions during drafting of the proposal

E-mail discussions/phone calls with several advanced Wikidata editors
Call with the Wikidata team of Wikimedia Deutschland
Feedback on draft proposal from two other members of the Wikidata/GLAM community

Additional notifications

Information about the project sent to the OpenBudgets.eu mailing list
Wikidata + GLAM group on Facebook was notified

Endorsements

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

Support. I endorse this proposal. This work is designed to reuse existing tooling. These tools have been described in published work LinkedPipes ETL: Evolved Linked Data Preparation (Q57240474) and LinkedPipes Visualization: Simple Useful Linked Data Visualization Use Cases (Q57240476). It would be great to support additional workflows like those they propose. Their plan to create guidance for others will help us all learn from this work. YULdigitalpreservation (talk) 16:03, 13 November 2018 (UTC)
Support. I am strongly in favor of applying LP-ETL, as a mature state-of-the-art ETL tool, more intensely in support of Wikidata. I have excellent experience with the tool, both as Participant Coordinator in the H2020 OpenBudgets.eu project, and as Professor at University of Economics, Prague, where I regularly let the students do labs with the tool (processing government open data, in particular). The tool is versatile and powerful, yet user-friendly enough to be used even by users with limited programming background. Vsvatek
Support. I strongly endorse this project. It would help Wikidata remain a database which contains up-to-date data using as much automation as possible. I can imagine we could use this project's outcomes for adding current data from the TDKIV database (now connected to Wikidata using https://www.wikidata.org/wiki/Property:P5398) once the initial data set is included. (Our present work includes a lot of manual labor but it is not due to the unavailability of tools but rather due to the fact that only now we are adding precise statements serving as links from one item to another. When this initial phase is over, automation is highly welcome :-).) Linda.jansova (talk) 12:45, 20 November 2018 (UTC)
Support. I endorse this proposal. The project will make it easier to continuously ingest public sector open data and help make elected officials accountable to the public. Mordae (talk) 12:50, 20 November 2018 (UTC)
Support. I endorse the project. It improves Wikidata and supports cooperation between Wikimedia and academic field at the same time :) Gabriela Boková (WMCZ) (talk) 13:22, 20 November 2018 (UTC)
Support. Nothing like this has ever been introduced to Wikidata community. If successful, many data imports will be automated, which will even more justify the removal of data from Wikimedia templates in favor of Wikidata. Matěj Suchánek (talk) 14:24, 24 November 2018 (UTC)
Support. I endorse the project. Proper documentation is absolutely necessary. Repeatability can be real gamechanger, at the moment only Mix'n'match has this feature (but very limited).Jklamo (talk) 13:32, 1 December 2018 (UTC)
Support This is a very tricky problem that needs a solution. I'm happy to see work put into figuring it out. --Lydia Pintscher (WMDE) (talk) 19:41, 30 December 2018 (UTC)
Support I have watched the LinkedPipes ETL screencast and it seems like a nice tool that covers an existing gap. --Micru (talk) 18:14, 31 December 2018 (UTC)
Support I consider it important that open data pertaining to central organizations, such as the Ministry of the Interior of the Czech Republic or the Czech Statistical Office, is being already processed through LP ELT. This means there are already valid references to use when communicating with undecided open data providers. A part of the whole package is the long-term sustainability of updated Wikidata, provided by automated imports. There is a lot of potential in that the application itself is modular. I see the development and use of LP ETL as the right approach in the overall strategy.
I would suggest adding more options to data import into other Wikiprojects. That would be useful for longer texts (more than 250 or 500 characters), or for partially unstructured data. (Considering importing data into both Wikidata and one of the sister wikiprojects.) --Laďa Nešněra (meta) (talk) 09:58, 12 January 2019 (UTC)
Support I definitely endorse this proposal. Outdated data are of little use, so Wikidata deeply needs a regularly data updates. Moreover, LinkedPipes ETL is already a well established project with a potential to grow. --YjM (talk) 17:56, 12 January 2019 (UTC)