Grants:Project/MFFUK/Wikidata & ETL/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2018-19 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/MFFUK/Wikidata & ETL.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learning from the grantee's first 3 months.

Summary

Poster for Wikimania 2019 representing the process of loading RDF data into Wikibases such as Wikidata

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

We have analyzed the problem of loading data into Wikibases and designed our approach.
We attended the Wikimedia Hackathon 2019, where we discussed our approach with the present Wikidata community and received positive feedback
We started the implementation of a Wikibase loader component for LinkedPipes ETL based on the Wikidata Toolkit library. The loader takes data in the Wikibase RDF dump format as input, and loads the data into the target Wikibase, creating and changing items as necessary.
Our Poster representing the process of loading RDF data into Wikibases such as Wikidata got accepted for Wikimania 2019 and so did our workshop, where we will demo the approach.

Methods and activities

LinkedPipes ETL pipeline loading data into Wikidata (and dealing with Wikibase API)

Pipeline in LinkedPipes ETL simplified using the Wikibase uploader component

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

The main idea of the project is: If Wikidata (and Wikibases) use the RDF dump format as the format in which querying happens, why not use it also as a format using wich data could be loaded back into Wikidata (and Wikibases). We already had LinkedPipes ETL (LP-ETL) for publishing data as Linked Data (in RDF), with many use cases and a library of transformation components. Loading such data into Wikidata was a natural way to extend the project.

The way we went about it was:

We have set up our own Wikibase instance where we experiment with the Wikibase API, creating and updating items, properties, qualifiers, references and values.
First we tried logging in and importing some data using Postman to get to understand the API better.
Then we created our first LP-ETL pipeline loading some data about Czech Veteran trees into the instance (see the complex pipeline screenshot).
Based on discussions at the Wikimedia Hackathon 2019 and the analysis of the situation we have done, we started the implementation of the Wikibase loader component in LP-ETL, which effectively allows us to simplify the data loading pipeline (seen on the second screenshot).
Currently, we are gradually adding support for more and more artefacts of the RDF dump format, so that it can also be used as an input format for Wikidata, not just a format for querying.
Also, by doing this, we are shielding the pipeline authors who know RDF and SPARQL from the complexities of the Wikibase JSON Model and API.

Midpoint outcomes

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

The main result so far is the Wikibase loader component, which can already be used to load data in a subset of the RDF dump format into a Wikibase. This can be tried out in the demo instance, although there is insufficient documentation right now.
LinkedPipes ETL was improved in the process, adding a debug data HTTP-based browsing capability, which was missing before (implemented only as FTP, which is deprecated). This can also be tried out in the demo instance.
We use the data source for the Czech Veteran trees in our first proof-of-concept pipeline, which will soon start loading data into Wikidata. So far, it is loading a subset of data only into our Wikibase, so that we can do experiments and break things.

Finances

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far? Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

We have spent our funds according to plan so far. We do not anticipate any changes for the second half of our project.

Learning

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

The Wikibase JSON Model and API are not easy to use, and not likely to change any time soon. We therefore hope to shield further developers from it by supporting the RDF dump format as input format in our approach.
Determining a diff, i.e. comparing data which already is in the Wikibase Query Service endpoint to the source data and determining what is the same and what is different is not that easy with SPARQL. We may have to consider a specialized component to help with this.

What is working well

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Air travel - Added a note about kiwi.com and using aggregators for routing, and airline website for the actual purchase.
Git repository for software - We develop the Wikibase loader on GitHub
Wikidata mass import - Added mention of LP-ETL and this project here.

Next steps and opportunities

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this grant at the end of your project, please also mention this here.

As we are finishing with the implementation of the loader component, in the second half of the project we will focus on evaluating it on more proof-of-concept transformations, which will eventually load data into the production Wikidata instance
We will also start writing documentation and tutorials and communicating examples to the community around Wikidata, so that we get some volunteers trying out our approach
We are considering applying for a renewal of this grant, because
- the Wikidata RDF format is quite complex already, and we foresee we will not be able to cover 100% of its features in this project. We will probably not support features such as Lexemes, ranking, redirects and sitelinks, as we want to focus on properly supporting the rest. These features may be added in the project extension.
- the Wikibase model itself may also add some new features such as new datatypes
- we see some opportunities for simplification of the loading process even further by assisting the user in determining which items are new and which need to be updated, as this is currently left to the user and may not be trivial using SPARQL.

Grantee reflection

We’d love to hear any thoughts you have on how the experience of being an grantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

My experience as a grantee has been great so far. I appreciate the ease and straightforwardness of the proposal process and the minimalistic approach to reporting, especially the focus on conciseness, which is hardly seen elsewhere.

What surprised me was the interest of participants of the Wikimedia Hackathon in the project and their willingness to help, which really encouraged us in our work.