soweego is a machine learning system that connects Wikidata to large-scale third-party catalogs.
It takes a set of Wikidata items and a given catalog as input, and links them through record linkage techniques based on supervised learning.
The main output is a dataset of Wikidata and third-party catalog identifier pairs.
The story so farEdit
soweegoWikidata bot uploaded hundreds of thousands confident links;
- medium-confident ones are in Mix'n'match (Q28054658) for curation;
- there is strong community support;
- outcomes fit the Wikidata development roadmap.
We see two principal growth directions:
- the validator component;
- addition of new third-party catalogs.
The former was originally out of the initial Project Grant scope. Nevertheless, we developed a prototype to address a key suggestion from the Wikimedia research team: "Develop a plan for syncing when external database and Wikidata diverge". The latter is a natural way to expand beyond the initial use case, thus increasing impact through more extensive coverage. Contributors engagement will be crucial for this task.
Why: the problemEdit
soweego complements the Wikidata development roadmap with respect to the Increase data quality and trust part.
It aims at addressing three open challenges that zoom in from a high-level perspective:
- missing feedback loops between Wikidata data donors and re-users;
- lack of methodical efforts to keep Wikidata in sync with third-party catalogs/databases;
- under-usage of the statement ranking system, with few bots performing most of the edits.
These challenges are intertwined among each other: synchronizing Wikidata to a given external database is a precondition to enable a feedback loop between both communities. At the same time, sync results can impact the ranking system usage.
How: the solutionEdit
We synchronize Wikidata to a given target catalog at a given point in time through a set of validation criteria.
- existence: whether a target identifier found in a given Wikidata item is still available in the target catalog;
- links: to what extent all URLs available in a Wikidata item overlap with those in the corresponding target catalog entry;
- metadata: to what extent relevant statements available in a Wikidata item overlap with those in the corresponding target catalog entry.
The application of these criteria to our running example translates into the following actions.
- Elvis Presley (Q303) has a MusicBrainz identifier 01809552, which does not exist in MusicBrainz anymore.
Action = mark the identifier statement with a deprecated rank;
- Elvis Presley (Q303) has 7 URLs, MusicBrainz 01809552 has 8 URLs, and 3 overlap.
Action = add 5 URLs from MusicBrainz to Elvis Presley (Q303) and submit 4 URLs from Wikidata to the MusicBrainz community;
- Wikidata states that Elvis Presley (Q303) was born on January 8, 1935 in Tupelo, while MusicBrainz states that 01809552 was born in 1934 in Memphis.
Action = add 2 referenced statements with MusicBrainz values to Elvis Presley (Q303) and notify 2 Wikidata values to the MusicBrainz community.
In case of either full or no overlap in criteria 2 and 3, the Wikidata identifier statement should be marked with a preferred or a deprecated rank respectively. Note that community discussion is essential to refine these criteria.
What: things doneEdit
soweego 1 has an experimental validator module that implements the aforementioned criteria.
If you know how to use a command line and feel audacious, you can install
soweego, import a target catalog, and try the validator out.
Besides that, major contributions in terms of content addition are:
- G1: take the
soweegovalidator component from experimental to stable;
- G2: submit validation results to the target catalog providers;
- G3: engage the Wikidata community via effective communication of
- G4: expand
soweegocoverage to additional target catalogs.
- O1: production-ready validator module, implementing criteria acknowledged by the community;
- O2: datasets to enable feedback loops on the Wikidata users side, namely
- automatically ranked identifier dataset, as a result of validation criteria actions;
- entity enrichment statement dataset, based on available target data;
- O3: datasets to enable feedback loops on the target catalog providers side, namely rotten URLs and additional URLs & values;
- O4: engagement tools for Wikidata users, including visualization of
soweegodatasets and data curation tutorials;
- O5: procedure to plug a new target catalog that minimizes programming efforts.
- Feedback loop between target data donors and Wikidata users:
as a target catalog consumer,
soweegocan shift the target maintenance burden. Edits performed by the
soweegobot (properly communicated through visualization for instance) close the loop from the Wikidata users side. A use case that emerged during the development of
soweego1 is the sanity check over target URLs: this yielded a rotten URLs dataset, which can be submitted to the target community;
- checks against target catalogs, which entail the enrichment of Wikidata items upon available data, especially relationships among entries. Use cases encountered in
soweego1 target catalogs include:
- automatic ranking of statements: a potentially huge impact on the Wikidata truthy statements dumps, which are both widely used as the "official" ones and underpin a vast portion of the Query Service.
The following numerical metrics are projections over all target catalogs currently supported by
soweego and all experimental validation criteria, based on estimates for a single catalog (i.e., Discogs (Q504063)) and a single criterion.
Note that the actual amount of total statements will depend on data available in Wikidata and in target catalogs.
- Validator datasets (O2):
- 250k ranked statements. Estimate: 21k identifier statements to be deprecated, as a result of criterion 1;
- 120k new statements. Estimate: 10k statements to be added or referenced, as a result of criterion 2;
- 440k rotten URLs. Estimate: 110k URLs, as a result of the sanity check at import time;
- 128k extra values. Estimate: 16k values, as a result of criterion 3.
From a qualitative perspective, a project-specific request for comment will act as a survey to collect feedback.
The 3 shared metrics are as follows.
- 50 total participants: sum of target catalog community fellows, Wikidata users facilitating the feedback loop, and contributors to the
- 25 newly registered users: to be gathered from the Mix'n'match (Q28054658) user base;
- 370k content pages created or improved: sum of Wikidata statements edited by the
|M2||Feedback loop, data providers side||G2||3-12||25%|
|M3||Feedback loop, data users side||G3||3-12||25%|
Note that some milestones overlap in terms of timespan: this is required to exploit mutual interactions among them.
|M1.1||Validation criteria||Refine criteria through community discussion||35%|
|M1.2||Automatic ranking||Submit validation criteria actions on Wikidata statement ranks||20%|
|M1.3||Automatic enrichment||Leverage target catalog relationships to generate Wikidata statements||20%|
|M1.4||Interaction with constraints check||Intercept reports of this Wikibase extension to improve
|M2.1||Rotten URLs||Contribute rotten URLs to target catalog providers||30%|
|M2.2||Extra URLs & content||Submit additional URLs (criterion 2) and values (criterion 3) from Wikidata to target catalog providers||70%|
||Improve communication of Wikidata edits made by
|M3.2||Data curation guidelines||Explain how to curate Wikidata and Mix’n’match contributions made by
|M3.3||Real-world evaluation||Switch from in vitro to in situ evaluation of the
|M4.1||New domains||Extend support of target catalogs beyond people and works (initial use case)||20%|
|M4.2||Codebase generalization||Minimize domain-dependent logic||30%|
|M4.3||Simple import||Squeeze the effort needed to add support for a new catalog||50%|
The total amount requested is 80,318 €.
|Project lead||Responsible for the full project implementation||Full time (40 hrs/week), 12 PM||52,735 €|
|Core system architect||Technical operations head||Full time, 12 PM||12,253 €|
|Research assistant||In charge of the applied machine learning components||Part time (20 hrs/week), 6 PM||14,330 €|
|Dissemination||Expenses to attend 2 relevant community events||One shot||1,000 €|
Gross salaries of the human resources are computed upon average estimates based on roles and locations. Gross labor rates per hour follow.
Note that the project lead will serve as the main grantee and will appropriately allocate the funding.
We identify the following relevant communities, which are both inside and outside the Wikimedia landscape.
- Wikidata development team;
- Mix'n'match (Q28054658) users;
- Target catalog owners:
Besides the currently supported target catalogs, we would like to engage Royal Botanic Gardens, Kew (Q18748726) through a formal collaboration with the Intelligent Data Analysis team.
The organization maintains a set of biodiversity catalogs, such as Index Fungorum (Q1860469).
The team is exploring directions to leverage and disseminate their collections.
Moreover, this would let
soweego scale up to a totally different domain.
Hence, we believe the collaboration would yield a 3-fold benefit: for Wikidata, Kew, and
- see for instance https://tools.wmflabs.org/mix-n-match/#/group/Music
- see more in #What:_things_done
- see third main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
- 4 catalogs, see Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection
- see the feedback loops with data re-users block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
- see the checks against 3rd party databases block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
- see for instance d:Special:Contributions/PreferentialBot
- the examples are fictitious and do not reflect the actual data
- plenty of queries use the truthy prefix
wdt, see for instance d:Wikidata:SPARQL_query_service/queries/examples#Showcase_Queries
- see #How:_the_solution
- person months
- corresponds to 25% of the salary. The rest is funded by the hosting university
- corresponds to 50% of the salary. The rest is funded by the hosting university
- see #Participants
- Marco Fossati
- Hjfocs is a research scientist with a double background in natural languages and information technology. He holds a PhD in computer science at the University of Trento (Q930528).
- His profile is highly hybrid and can be defined as an enthusiastic leader of applied research in natural language processing for the Web, backed by a strong dissemination attitude and an innate passion for open knowledge, all blended with software engineering skills and a deep affection for human languages.
- He is currently focusing on Wikidata data quality and has been leading
soweegosince the very first prototype, as well as the StrepHit project, both funded by the Wikimedia Foundation.
- Emilio Dorigatti
- Edorigatti is a PhD candidate at the Ludwig Maximilian University of Munich (Q55044), where he is applying machine learning methods to the design of personalized vaccines for HIV and cancer, in collaboration with the Helmholtz Zentrum München (Q878592).
- He holds a BSc in computer science and a double MSc in data science, with a minor in innovation and entrepreneurship. He also has several years of working experience as a software developer, data engineer, and data scientist. He is mostly interested in Bayesian inference, uncertainty quantification, and Bayesian Deep Learning.
- Emilio was a core team member of the StrepHit project.
- Massimo Frasson
- MaxFrax96 has been a professional software engineer since more than 5 years. He has mainly worked as an iOS developer and game developer at Belka.
- He has always had a strong interest in software architecture and algorithm performance. He is currently attending a MSc in computer science at the University of Milan (Q46210) to deepen his knowledge on data science.
- Massimo has been involved into
soweegosince the early stages, and has made key contributions to its development.
- Advisor Volunteering my experience from Mix'n'match, amongst others Magnus Manske (talk) 10:47, 17 February 2020 (UTC)
- Volunteer I'm part of OpenMLOL, which has a a property in Wikidata (https://www.wikidata.org/wiki/Property:P3762) as the identifier of an author in openMLOL. I'm trying to standardize our author ID with the Wikidata ID. .laramar. (talk) 16:18, 20 February 2020 (UTC)
The links below reference notifications to relevant mailing lists and Wiki pages. They are sorted in descending order of specificity.
- Wikidata: https://lists.wikimedia.org/pipermail/wikidata/2020-February/013842.html
- Wikidata project chat: d:Wikidata:Project_chat#soweego_2_proposal
- Wikidata weekly summary: d:Wikidata:Status_updates/2020_02_24
- Wikidata Telegram channel: https://t.me/joinchat/AZriqUj5UagVMHXYzfZFvA
- Wiki research: https://lists.wikimedia.org/pipermail/wiki-research-l/2020-February/007127.html
- AI: https://lists.wikimedia.org/pipermail/ai/2020-February/000296.html
- Wikimedia: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-February/094292.html
Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project by clicking the blue button in the infobox, or edit this section directly. (Other constructive feedback is welcome on the discussion page).
- soweego 1: Grants:Project/Hjfocs/soweego#Endorsements
- soweego 1.1: Grants:Project/Rapid/Hjfocs/soweego_1.1#Endorsements
- Very cool project idea. 2001:4CA0:0:F235:2486:8ED5:C3AA:C914 09:30, 27 January 2020 (UTC)
- Connections between DBs. Frettie (talk) 15:59, 12 February 2020 (UTC)
- Important update/sync mechanism for large third-party catalogs Magnus Manske (talk) 10:20, 17 February 2020 (UTC)
- Helps linking item around the world ;) Crazy1880 (talk) 17:16, 19 February 2020 (UTC)
- As Wikidata matures and becomes more important for knowledge graphs and the products (commercial or not) built upon them, it also becomes more important to keep its quality high. This project contributes to Wikidata's quality by detecting obsolete links between Wikidata and 3rd party databases, discovering new links between Wikidata and 3rd party databases, and helping synchronize facts between Wikidata and 3rd party databases. (Nicolastorzec (talk) 00:43, 20 February 2020 (UTC)
- Matching existing items is our biggest bottleneck in integrating new data sources. Any technological help in this regard is a good thing. The team also looks strong. 99of9 (talk) 01:35, 20 February 2020 (UTC)
- sounds reasonable and project site is well designed 22.214.171.124 05:42, 20 February 2020 (UTC)
- Link curation is hard and this will continue to make it easier and more efficient. StudiesWorld (talk) 11:55, 20 February 2020 (UTC)
- As we import more catalogs, we need more tools to improve the matching process. This will help. - PKM (talk) 21:44, 20 February 2020 (UTC)
- To complete previous work and provide more automated tools to data catalogs. Sabas88 (talk) 14:28, 21 February 2020 (UTC)
- This is useful, a benefit for Wikidata as well as the other catalogues, and makes it easy and fun to participate in the creation of knowledge records. Sebastian Wallroth (talk) 09:04, 22 February 2020 (UTC)
- I've always been a supporter of the project, and continue to be one! Sannita - not just another it.wiki sysop 17:52, 22 February 2020 (UTC)
- Any technology that helps linking item between DBs and makes easier to participate it's a good idea Tiputini (talk) 18:07, 23 February 2020 (UTC)
- I particularly like the sync-ing with third party databases. 126.96.36.199 15:22, 24 February 2020 (UTC)
- Essential for keeping Wikidata updated and usable for rapidly changing databases.-Nizil Shah (talk) 02:14, 26 February 2020 (UTC)
- This is potentially valuable. To get the most of it, though, the catalogs generated for Mix 'n' Match must be improved (better identifiers and descriptions to allow decisions without *always* having to click through). Some reflection is also due on why relatively little work has been done with MnM so far. Finally, I'd like to see a clear plan for maintenance and sustainability of the project beyond the project lead's current academic context. Will community members be able to keep running the tool? Ijon (talk) 13:07, 26 February 2020 (UTC)
- The proposal promises a good extension of the first proposal, and the first proposal has really shown it's worth. Identifiers and mappings are an important part of Wikidata, and will become crucial for quality efforts. This is an extremely helpful step to further enable a quality checking approach against external data sources. I agree with Ijon that a focus should be put on making it sustainable so that the project results can be kept active without the project ongoing in the future. denny (talk) 15:02, 26 February 2020 (UTC)
- As Magnus Manske. Epìdosis 15:28, 26 February 2020 (UTC)