Grants talk:Project/MFFUK/Wikidata & ETL
Interesting Project
editThis looks like a super interesting project -- and it looks like it will have a lot of implications for GLAM and other batch uploading projects. I am going to ping a good number of folks who likely will be interested in this grant: @SandraF (WMF), Jean-Frédéric, Beat Estermann, Alicia Fagerving (WMSE), Axel Pettersson (WMSE), and Multichill: would love your feedback. Astinson (WMF) (talk) 20:27, 19 November 2018 (UTC)
- This feels a bit like the GLAMwiki Toolset, but I see this is a local tool (thick client). Also makes me wonder what happened with GLAMpipe, Susannaanas can you comment?
- @Astinson (WMF): The tool can be deployed anywhere - locally and on a web server to be shared by multiple users. Also, the aim is to focus on actual items and the data about them rather than uploading media to Commons Jakub.klimek (talk) 09:13, 18 December 2018 (UTC)
- Let's say we spend this money and build the tool. Who are your users and who is going to import what? Lack of people actually using it really hurt the GWToolset.
- @Astinson (WMF): There is definitely a local community willing to use the tool for uploading data to Wikidata. Part of the project is spreading the word. Jakub.klimek (talk) 09:13, 18 December 2018 (UTC)
- What kind of development method are you using? Agile? Who is going to be your customers?
- @Astinson (WMF): So far, we have used agile. The tool is under development for 3 years already, and there is a community using it for production of RDF (linked) data. Some of those existing users might be interested in having a possibility to prepare data also for Wikidata and other Wikibases. Jakub.klimek (talk) 09:13, 18 December 2018 (UTC)
- Who is going to support the tool and it's code after the project is over? For how long and general time investment?
- @Astinson (WMF): The development and support of the tool so far was funded from various research projects where it was used, e.g. OpenBudgets.eu H2020 project, multiple Czech grant agencies and 3 technical universities in Prague. We allocate portions of each of our research and/or application projects to the support and development of the tool and we will try to keep it that way also in our future endeavors.
- I don't mind spending money, but I would like to see lasting impact. Multichill (talk) 18:40, 20 November 2018 (UTC)
- @Multichill: I understand, please see the explanation above. Jakub.klimek (talk) 09:13, 18 December 2018 (UTC)
Feedback from Harej
editThank you for submitting this proposal. It's a great start and it looks like an interesting project. As a volunteer I've done a lot of work with large-scale importation of data into Wikidata, so I am particularly happy to see there is interest in making this work easier for others who want to do it.
I would recommend defining specifically what kind of users are expected to use this tool once it is developed. Are you targeting tech savvy users who want a convenient alternative to writing one-off Python scripts, or is this meant to be more user-friendly? What is the nature of the datasets being imported? Are you planning on supporting all kinds of datasets, or a narrower subset that is best suited for Wikidata? Either ahead of time or in the beginning stages of the project you should be working with your potential users to figure out what thing you need to build and who you are building it for. Focus on doing a good job solving a few problems, rather than trying to make everyone happy.
- @Harej (WMF): I added a parenthesis to the project goals specifying that we focus on tech savvy users. The tool as an API which can be later used to create a user-friendly "wizard" layer on top of it. However, developing such a user-friendly frontend can only be done when the technical part underneath is finished. So this may be a topic for our next project. Jakub.klimek (talk) 09:33, 18 December 2018 (UTC)
Also, I would make sure you've done outreach to the Wikidata community, including through project chat. Keep in mind that your stakeholders not only include the direct users of the software, but people who are affected by the software as well, including administrators and patrollers who may have to clean up after it or block users who are misusing it.
- @Harej (WMF): Thanks, we will do that. Jakub.klimek (talk) 09:33, 18 December 2018 (UTC)
I also think that in addition to documentation, some proactive effort should be made to make the tool as easy to use as possible, reducing the number of steps needed to complete a given task.
- @Harej (WMF): I agree, however, this seems to be an optimization task which we can focus on once all the technical steps necessary are clearly identified and implemented. This is related to the user-friendliness mentioned above.
Reconciliation
editIn my experience, the bottleneck to imports is usually much more about reconciling the external data to pre-existing Wikidata Item IDs, and avoiding duplicates (either of items, or of existing statements on those items), rather than the much more mechanical process of actually then adding the data. I don't see anything in your proposal that discusses this area — perhaps you could talk a little bit more about your plans for that? --Oravrattas (talk) 11:27, 30 November 2018 (UTC)
- @Oravrattas: You are right, I added a sentence to the solution part about that. We are aware that this is going to be one of the challenges of the project, we just did not mention it explicitly. Defining the precise approach to this is actually part of the project, the rough idea is (coming from the Linked Data world) to keep a mapping of Wikidata Item IDs to URLs the objects/items would receive in the Linked Data world, probably stored as separate Wikidata statements. Jakub.klimek (talk) 09:26, 18 December 2018 (UTC)
- @Jakub.klimek: I'm glad this is on your radar :) Having some way to store the results of the reconciliation like this is certainly useful, but I fear that this is still sidestepping the initial issue of how that reconciliation happens in the first place. Is the goal to automate that in some way? Or will this, for example, only be useful in places where all the data for all relevant fields to be linked have already been run through something like Mix'n'Match? --Oravrattas (talk) 09:48, 18 December 2018 (UTC)
- @Oravrattas: The primary focus here is on data which has its single source of truth outside of Wikidata, using a process similar to Linked Data production. For datasources like these, I imagine the workflow to be something like this: 1) download and transform the source data into RDF, assigning non-Wikidata URLs to entities, and wikidata property IDs to properties, etc. 2) Query Wikidata for items from the same datasource, which already have those non-Wikidata URLs stored with them. 3) For existing ones, we have the existing Wikidata Item IDs, for new ones, we just store the mapping - wikidata assigns the item ID and it will be used next time the update happens. The exact process is, however, part of solving the project. Jakub.klimek (talk) 10:51, 18 December 2018 (UTC)
- @Jakub.klimek: I'm sorry, but I'm finding this a little too abstract and difficult to understand. Perhaps you could give a worked example of a source, and the sort of data that this tool might be able to extract from there and upload to Wikidata? This doesn't need to be a source that you would actually use, should the application be successful — anything that conveys this process you describe in a more concrete manner would be useful. --Oravrattas (talk) 20:01, 18 December 2018 (UTC)
- @Oravrattas: So far, the tool does ETL from anything to RDF. So lets say we start with an XML file describing the city of Prague (and all subordinate objects such as buildings, but this does not really matter). First, it creates its RDF representation using a data transformation pipeline manually created in the tool. The entity for the City of Prague has then an IRI such as this https://ruian.linked.opendata.cz/zdroj/obce/554782. This is what is already working. The missing part is loading parts of this data to Wikidata. The design of the exact process of doing so is part of the project itself. We know there are some issues we will need to tackle. So one of the possibilities for what I now imagine the rest of the process will look like (and maybe during the project we will discover some issues) is: Lets say I want to insert a statement into Wikidata that the city of Prague has the code 554782 in our national codelist. If the representation of Prague in Wikidata already has the IRI (https://ruian.linked.opendata.cz/zdroj/obce/554782) stored with it (perhaps using a special property), the tool would query for it, discovering the corresponding Wikidata item ID and it would insert the statement there. If there is no Wikidata item corresponding to Prague, it will be created, with the link stored for future updates. The worst case scenario (the one you initially talked about) is that there already is a Wikidata item representing Prague, with no link specified. And there are two options I can think of now. One, the link will have to be inserted using another process in advance (e.g. using Silk link discovery framework via the Wikidata Query Service), or two, a duplicate item will be created, and possibly reconciled later, which is obviously not the preferred solution. It should be noted that the primary aim of the project is not to solve all reconciliation issues (as this could easily be a topic for a separate, maybe consecutive, project). In addition, the reconciliation issues are not specific to Wikidata, they apply to every data integration task. The main focus of the project is to enable bulk loading of data into Wikidata in a consistent way (from the definition of data transformation and their management point of view), preferably from authoritative sources for such data. Actually, if Wikidata had the mappings to the authoritative sources in the first place, the reconciliation issues would be significantly reduced. Of course, in the resulting methodology, the options for avoiding the creation of duplicates will be discussed. Jakub.klimek (talk) 10:13, 19 December 2018 (UTC)
- @Jakub.klimek: thank you for taking the time to spell this out. Is it the case, therefore, that the only data this tool will attempt to load will be "simple" values (e.g. number, string, or date), and not attempt to load anything that requires a link to another Item? Unfortunately I can't read Czech, so it's difficult for me to find suitable examples from that XML file, but if, for example your sources were to list the mayor of each city, would this tool need to skip those, or would it be able to update the P6 (head of government) record, and if so, how would Zdeněk Hřib (say) get reconciled to the relevant Wikidata item? I understand that you might not yet know how exactly you will implement all the details of the project, but as you say this is not only a critical element, but a sufficiently large problem that it could easily deserve a much bigger grant of its own, so it would be useful to know in advance what you think you would be able to do entirely through software (and roughly how you think that could work) and which parts would need human intervention (and at what stage, and by whom, this would happen) or would be treated as out of scope for this initial version. --Oravrattas (talk) 13:01, 20 December 2018 (UTC)
- @Oravrattas: In the case you describe, there are 2 things. The mapping of the source data to P6 (created manually in the transformation) and finding the Wikidata Item ID for Zdeněk Hřib. There I imagine the process will be similar to finding the original item ID for Prague - knowing the type of thing the source data links to, Wikidata will be queried for existing items with mappings, etc. For nonexistent mappings, again, an external linking tool will have to be used by the pipeline designer. We will show how this can be done, and demonstrate on sample datasets, but further reconciliation efforts will be out of scope of this project. Jakub.klimek (talk) 07:10, 21 December 2018 (UTC)
- @Jakub.klimek: thank you for taking the time to spell this out. Is it the case, therefore, that the only data this tool will attempt to load will be "simple" values (e.g. number, string, or date), and not attempt to load anything that requires a link to another Item? Unfortunately I can't read Czech, so it's difficult for me to find suitable examples from that XML file, but if, for example your sources were to list the mayor of each city, would this tool need to skip those, or would it be able to update the P6 (head of government) record, and if so, how would Zdeněk Hřib (say) get reconciled to the relevant Wikidata item? I understand that you might not yet know how exactly you will implement all the details of the project, but as you say this is not only a critical element, but a sufficiently large problem that it could easily deserve a much bigger grant of its own, so it would be useful to know in advance what you think you would be able to do entirely through software (and roughly how you think that could work) and which parts would need human intervention (and at what stage, and by whom, this would happen) or would be treated as out of scope for this initial version. --Oravrattas (talk) 13:01, 20 December 2018 (UTC)
- @Oravrattas: So far, the tool does ETL from anything to RDF. So lets say we start with an XML file describing the city of Prague (and all subordinate objects such as buildings, but this does not really matter). First, it creates its RDF representation using a data transformation pipeline manually created in the tool. The entity for the City of Prague has then an IRI such as this https://ruian.linked.opendata.cz/zdroj/obce/554782. This is what is already working. The missing part is loading parts of this data to Wikidata. The design of the exact process of doing so is part of the project itself. We know there are some issues we will need to tackle. So one of the possibilities for what I now imagine the rest of the process will look like (and maybe during the project we will discover some issues) is: Lets say I want to insert a statement into Wikidata that the city of Prague has the code 554782 in our national codelist. If the representation of Prague in Wikidata already has the IRI (https://ruian.linked.opendata.cz/zdroj/obce/554782) stored with it (perhaps using a special property), the tool would query for it, discovering the corresponding Wikidata item ID and it would insert the statement there. If there is no Wikidata item corresponding to Prague, it will be created, with the link stored for future updates. The worst case scenario (the one you initially talked about) is that there already is a Wikidata item representing Prague, with no link specified. And there are two options I can think of now. One, the link will have to be inserted using another process in advance (e.g. using Silk link discovery framework via the Wikidata Query Service), or two, a duplicate item will be created, and possibly reconciled later, which is obviously not the preferred solution. It should be noted that the primary aim of the project is not to solve all reconciliation issues (as this could easily be a topic for a separate, maybe consecutive, project). In addition, the reconciliation issues are not specific to Wikidata, they apply to every data integration task. The main focus of the project is to enable bulk loading of data into Wikidata in a consistent way (from the definition of data transformation and their management point of view), preferably from authoritative sources for such data. Actually, if Wikidata had the mappings to the authoritative sources in the first place, the reconciliation issues would be significantly reduced. Of course, in the resulting methodology, the options for avoiding the creation of duplicates will be discussed. Jakub.klimek (talk) 10:13, 19 December 2018 (UTC)
- @Jakub.klimek: I'm sorry, but I'm finding this a little too abstract and difficult to understand. Perhaps you could give a worked example of a source, and the sort of data that this tool might be able to extract from there and upload to Wikidata? This doesn't need to be a source that you would actually use, should the application be successful — anything that conveys this process you describe in a more concrete manner would be useful. --Oravrattas (talk) 20:01, 18 December 2018 (UTC)
- @Oravrattas: The primary focus here is on data which has its single source of truth outside of Wikidata, using a process similar to Linked Data production. For datasources like these, I imagine the workflow to be something like this: 1) download and transform the source data into RDF, assigning non-Wikidata URLs to entities, and wikidata property IDs to properties, etc. 2) Query Wikidata for items from the same datasource, which already have those non-Wikidata URLs stored with them. 3) For existing ones, we have the existing Wikidata Item IDs, for new ones, we just store the mapping - wikidata assigns the item ID and it will be used next time the update happens. The exact process is, however, part of solving the project. Jakub.klimek (talk) 10:51, 18 December 2018 (UTC)
- @Jakub.klimek: I'm glad this is on your radar :) Having some way to store the results of the reconciliation like this is certainly useful, but I fear that this is still sidestepping the initial issue of how that reconciliation happens in the first place. Is the goal to automate that in some way? Or will this, for example, only be useful in places where all the data for all relevant fields to be linked have already been run through something like Mix'n'Match? --Oravrattas (talk) 09:48, 18 December 2018 (UTC)
Eligibility confirmed, round 2 2018
editWe've confirmed your proposal is eligible for round 2 2018 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through January 2, 2019.
The Project Grant committee's formal review for round 2 2018 will occur January 3-January 28, 2019. Grantees will be announced March 1, 2018. See the schedule for more details.
Questions? Contact us.--I JethroBT (WMF) (talk) 03:23, 8 December 2018 (UTC)
A few comments/questions
editThank you for you proposal. However I have a few comments questions:
- As I understand users will run this software on their on their computers. Does this mean that they will use their own Wikimedia accounts for adding data? I mean that there will be no a central bot account which will do all data additions?
- @Ruslik: There will definitely be no central bot, as this would be a bottleneck. Everyone interested in bulk loading data should be able to do so independently. Then it is up to the users which credentials they will use. This will be discussed in the resulting methodology. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC)
- If so, then how will this be implemented in practice? I mean that ordinary users are limited in rates with which they can do editing and if this tool attempts to add information at too high rate the edits will be rejected. Does this mean that users will need to create separate bot accounts in order to use this tool?
- @Ruslik: If there is such limitation, then yes, each user wanting to bulk load data will have to create their own bot account. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC)
- Have you considered using primary sources tool for manually reviewing the data, which your tool will be proposing for addition to Wikidata?
- @Ruslik: We have not. So far we imagine the datasets can be rather large, for instance, the list of all villages in the Czech Republic has about 6500 items. For our use case, e.g. to automatically update numbers of inhabitants, the manual approval process would be too lengthy. Nevertheless, for smaller datasets, this would definitely be the option, if PST is indeed developed as described on its page, i.e. to support import of RDF datasets, not only QuickStatements. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC)
- In which language will the tool be written? How will it be distributed: as a binary file or as the source code or in some other way?
- @Ruslik: The tool is currently being written in Java and JavaScript, as can be seen on its GitHub page. Currently, it is being distributed as source code, but we can consider doing binary releases for selected platforms. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC)
- Have you notified the Wikidata community (on wiki)? How are going to interact with the Wikidata community when the data uploads begin?
- @Ruslik: Not on wiki. As stated in the Community notification section, we notified them on Facebook and personally contacted selected Wikidata editors who we know. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC))
- As for interactions with Wikidata community, we will start very small, make incremental improvements and ask for feed-back from community at each step. --Vojtěch Dostál (talk) 08:27, 21 December 2018 (UTC)
- You wrote that "Given that the Wikidata data model is different from the RDF data model". In which way are they different? As I know the Wikidata data model is also RDF based.
- @Ruslik: Wikidata has an RDF export, which can be queried using the Wikidata Query Service. However, the data model (expressed in RDF) is more complex than simple RDF statements. It resembles reified RDF - where we have statements about statements (qualifiers of statements in Wikidata), which makes it a bit more complex to query. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC)
- And finally, what is the opinion of the main Wikidata development team about your project?
- @Ruslik: We had a skype call with Wikimedia Deutschland (as stated in the project page - community notification) and they support the effort. Jakub.klimek (talk) 07:18, 21 December 2018 (UTC)
Ruslik (talk) 20:55, 15 December 2018 (UTC)
- This seems useful. I think a good way to announce it would be Wikidata project chat or the weekly news update. (I added it to the later). I'd avoid using Facebook as it isn't actually a WMF channel. Will try to get the software to work before endorsing. Jura1 (talk) 10:06, 31 December 2018 (UTC)
Aggregated feedback from the committee for MFFUK/Wikidata & ETL
editScoring rubric | Score | |
(A) Impact potential
|
7.8 | |
(B) Community engagement
|
7.5 | |
(C) Ability to execute
|
7.5 | |
(D) Measures of success
|
7.0 | |
Additional comments from the Committee:
|
This proposal has been recommended for due diligence review.
The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.
Next steps:
- Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
- Following due diligence review, a final funding decision will be announced on March 1st, 2019.
I JethroBT (WMF) (talk) 19:37, 6 February 2019 (UTC)
Round 2 2018 decision
editCongratulations! Your proposal has been selected for a Project Grant.
WMF has approved partial funding for this project, in accordance with the committee's recommendation. This project is funded with $39,687.50 USD
Comments regarding this decision:
The Project Grants Committee supports partial funding of this proposal at roughly half the level of original request. There is clear support for this work to automate data imports that would help keep information on Wikidata current. However, the concerns regarding the current state of the tool after multiple years of development and unclear community engagement plans to encourage adoption of the tool prevented the committee from supporting full funding of the proposal.
Next steps:
- You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
- Review the information for grantees.
- Use the new buttons on your original proposal to create your project pages.
- Start work on your project!
Upcoming changes to Wikimedia Foundation Grants
Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.
Alex Wang (WMF) (talk) 17:12, 1 March 2019 (UTC)
OpenRefine
editOpenRefine now has a direct interface to submit changes to Wikidata and has been used for rather large scale editing, so I don't quite understand the comment in the page. Nemo 18:20, 2 March 2019 (UTC)
Results
editThe proof of the pudding is in the eating. Do I understand correctly that so far the main concrete results on Wikidata itself was d:Wikidata:Requests for permissions/Bot/LinkedPipes ETL Bot MN, an import with 84 edits? Nemo 19:45, 22 February 2020 (UTC)
- @Nemo bis: Sorry for the late reply, for some reason I did not get notified about the edit and it is quite some time from the project end. To answer your question, no, that is just one of the results (bot created by a volunteer). All the bots used within the project are listed here in the report. They can be run at any time on request - they are not scheduled to run periodically at the moment. Jakub Klímek (talk) 15:25, 6 March 2020 (UTC)