Grants:Project/Hjfocs/soweego 2

This project is funded by a Project Grant

statusselected

soweego 2

summarysoweego is an artificial intelligence that links Wikidata to large external catalogs.
Now soweego wants to enable feedback loops between Wikidatans and catalog owners, through mutual data quality checks.
Unity is strength.

targetWikidata

type of granttools and software, research

amount80k €

type of applicantindividual

grantee• Hjfocs

advisor• Magnus Manske

contact• fossati

spaziodati.eu

volunteer• .laramar.• Back ache

this project needs...

volunteer

give feedback

join

endorse

created on15:03, 21 January 2020 (UTC)

Friendly space expectations

Idea

soweego is a machine learning system that connects Wikidata to large-scale third-party catalogs. It takes a set of Wikidata items and a given catalog as input, and links them through record linkage^[1] techniques based on supervised learning.^[2] The main output is a dataset of Wikidata and third-party catalog identifier pairs.

Running example

Elvis Presley is Q303 in Wikidata and 01809552 in MusicBrainz: soweego links Q303 to 01809552 through a Wikidata identifier statement.

soweego's core task explained with the help of Elvis Presley (Q303).

The story so far

The project started with a 1-year Project Grant,^[3] which led to the release of version 1.^[4] A Rapid Grant^[5] followed for version 1.1. Outcomes look promising:

the soweego Wikidata bot uploaded hundreds of thousands confident links;^[6]
medium-confident ones are in Mix'n'match (Q28054658) for curation;^[7]
there is strong community support;^[8]^[9]
outcomes fit the Wikidata development roadmap.^[10]

Growth directions

We see two principal growth directions:

the validator component;
addition of new third-party catalogs.

The former was originally out of the initial Project Grant scope. Nevertheless, we developed a prototype^[11] to address a key suggestion from the Wikimedia research team: "Develop a plan for syncing when external database and Wikidata diverge".^[12] The latter is a natural way to expand beyond the initial use case,^[13] thus increasing impact through more extensive coverage. Contributors engagement will be crucial for this task.

Why: the problem

soweego complements the Wikidata development roadmap with respect to the Increase data quality and trust part.^[10] It aims at addressing three open challenges that zoom in from a high-level perspective:

missing feedback loops between Wikidata data donors and re-users;^[14]^[15]
lack of methodical efforts to keep Wikidata in sync with third-party catalogs/databases;^[16]
under-usage of the statement ranking system,^[17] with few bots performing most of the edits.^[18]

These challenges are intertwined among each other: synchronizing Wikidata to a given external database is a precondition to enable a feedback loop between both communities. At the same time, sync results can impact the ranking system usage.

How: the solution

We synchronize Wikidata to a given target catalog at a given point in time through a set of validation criteria.

existence: whether a target identifier found in a given Wikidata item is still available in the target catalog;
links: to what extent all URLs available in a Wikidata item overlap with those in the corresponding target catalog entry;
metadata: to what extent relevant statements available in a Wikidata item overlap with those in the corresponding target catalog entry.

The application of these criteria to our running example translates into the following actions.^[19]

Elvis Presley (Q303) has a MusicBrainz identifier 01809552, which does not exist in MusicBrainz anymore.
Action = mark the identifier statement with a deprecated rank;
Elvis Presley (Q303) has 7 URLs, MusicBrainz 01809552 has 8 URLs, and 3 overlap.
Action = add 5 URLs from MusicBrainz to Elvis Presley (Q303) and submit 4 URLs from Wikidata to the MusicBrainz community;
Wikidata states that Elvis Presley (Q303) was born on January 8, 1935 in Tupelo, while MusicBrainz states that 01809552 was born in 1934 in Memphis.
Action = add 2 referenced statements with MusicBrainz values to Elvis Presley (Q303) and notify 2 Wikidata values to the MusicBrainz community.

In case of either full or no overlap in criteria 2 and 3, the Wikidata identifier statement should be marked with a preferred or a deprecated rank respectively. Note that community discussion is essential to refine these criteria.

What: things done

Example of soweego catalogs uploaded to Mix'n'match (Q28054658)

soweego 1 has an experimental validator module^[20] that implements the aforementioned criteria. If you know how to use a command line and feel audacious, you can install soweego,^[21] import a target catalog,^[22] and try the validator out.^[23]

The soweego bot^[24] already has an approved task for criterion 2,^[25] together with a set of test edits.^[26] In addition, we performed (then reverted) a set of test edits for criterion 1.^[27]

Besides that, major contributions in terms of content addition are:

roughly 255,000^[28] confident identifier statements uploaded by the soweego bot, totalling 482,000^[6] Wikidata edits;
around 126,000^[28] medium-confident identifiers submitted to Mix'n'match (Q28054658) for curation.

Goals

G1: take the soweego validator component from experimental to stable;
G2: submit validation results to the target catalog providers;
G3: engage the Wikidata community via effective communication of soweego results;
G4: expand soweego coverage to additional target catalogs.

Impact

Output

O1: production-ready validator module, implementing criteria acknowledged by the community;
O2: datasets to enable feedback loops on the Wikidata users side, namely
- automatically ranked identifier dataset, as a result of validation criteria actions;
- entity enrichment statement dataset, based on available target data;
O3: datasets to enable feedback loops on the target catalog providers side, namely rotten URLs and additional URLs & values;
O4: engagement tools for Wikidata users, including visualization of soweego datasets and data curation tutorials;
O5: procedure to plug a new target catalog that minimizes programming efforts.

Outcomes

Feedback loop between target data donors and Wikidata users:
as a target catalog consumer, soweego can shift the target maintenance burden. Edits performed by the soweego bot (properly communicated through visualization for instance) close the loop from the Wikidata users side. A use case that emerged during the development of soweego 1 is the sanity check over target URLs: this yielded a rotten URLs dataset, which can be submitted to the target community;
checks against target catalogs, which entail the enrichment of Wikidata items upon available data, especially relationships among entries. Use cases encountered in soweego 1 target catalogs include:
- works by people, with an approved task^[29] and corresponding test edits;^[30]
- band membership of musicians;
automatic ranking of statements: a potentially huge impact on the Wikidata truthy statements dumps,^[31] which are both widely used as the "official" ones and underpin a vast portion of the Query Service.^[32]

soweego advertisement gone wrong at Wikimania 2019 :-D. Spotted by Sj (thanks!)

Local metrics

The following numerical metrics are projections over all target catalogs currently supported by soweego and all experimental validation criteria,^[33] based on estimates for a single catalog (i.e., Discogs (Q504063)) and a single criterion. Note that the actual amount of total statements will depend on data available in Wikidata and in target catalogs.

Validator datasets (O2):

250k ranked statements. Estimate: 21k identifier statements to be deprecated, as a result of criterion 1;
120k new statements. Estimate: 10k statements to be added or referenced, as a result of criterion 2;

feedback loop datasets, target side (O3):

440k rotten URLs. Estimate: 110k URLs, as a result of the sanity check at import time;
128k extra values. Estimate: 16k values, as a result of criterion 3.

From a qualitative perspective, a project-specific request for comment^[34] will act as a survey to collect feedback.

Shared metrics

The 3 shared metrics^[35] are as follows.

50 total participants: sum of target catalog community fellows, Wikidata users facilitating the feedback loop, and contributors to the soweego system;
25 newly registered users: to be gathered from the Mix'n'match (Q28054658) user base;
370k content pages created or improved: sum of Wikidata statements edited by the soweego bot.

Plan

Work package

High-level milestones
ID	Title	Goal	Month	Effort
M1	Validator	G1	1-8	30%
M2	Feedback loop, data providers side	G2	3-12	25%
M3	Feedback loop, data users side	G3	3-12	25%
M4	New catalogs	G4	8-12	20%

Note that some milestones overlap in terms of timespan: this is required to exploit mutual interactions among them.

Milestones breakdown: lower-level stories
ID	Title	Action	Effort
M1.1	Validation criteria	Refine criteria through community discussion	35%
M1.2	Automatic ranking	Submit validation criteria actions on Wikidata statement ranks	20%
M1.3	Automatic enrichment	Leverage target catalog relationships to generate Wikidata statements	20%
M1.4	Interaction with constraints check	Intercept reports of this Wikibase extension to improve `soweego` results	25%
M2.1	Rotten URLs	Contribute rotten URLs to target catalog providers	30%
M2.2	Extra URLs & content	Submit additional URLs (criterion 2) and values (criterion 3) from Wikidata to target catalog providers	70%
M3.1	`soweego` bot dashboard	Improve communication of Wikidata edits made by `soweego` via data visualization	35%
M3.2	Data curation guidelines	Explain how to curate Wikidata and Mix’n’match contributions made by `soweego`	30%
M3.3	Real-world evaluation	Switch from in vitro to in situ evaluation of the `soweego` system	35%
M4.1	New domains	Extend support of target catalogs beyond people and works (initial use case)	20%
M4.2	Codebase generalization	Minimize domain-dependent logic	30%
M4.3	Simple import	Squeeze the effort needed to add support for a new catalog	50%

Budget

The total amount requested is 80,318 €.

Budget breakdown
Item	Description	Commitment	Cost
Project lead	Responsible for the full project implementation	Full time (40 hrs/week), 12 PM^[36]	52,735 €
Core system architect	Technical operations head	Full time, 12 PM	12,253 €^[37]
Research assistant	In charge of the applied machine learning components	Part time (20 hrs/week), 6 PM	14,330 €^[38]
Dissemination	Expenses to attend 2 relevant community events	One shot	1,000 €
Total			80,318 €

Gross salaries of the human resources are computed upon average estimates based on roles and locations.^[39] Gross labor rates per hour follow.

project lead: 27.46 €;^[40]
core system architect: 25.52 €;^[41]
research assistant: 29.85 €.^[42]

Note that the project lead will serve as the main grantee and will appropriately allocate the funding.

Community engagement

We identify the following relevant communities, which are both inside and outside the Wikimedia landscape.

Wikidata development team;
Mix'n'match (Q28054658) users;
Target catalog owners:
- Discogs (Q504063) team;^[43]
- MusicBrainz (Q14005) team;^[44]
- Internet Movie Database (Q37312) team.^[45]

Besides the currently supported target catalogs, we would like to engage Royal Botanic Gardens, Kew (Q18748726) through a formal collaboration with the Intelligent Data Analysis team.^[46] The organization maintains a set of biodiversity catalogs, such as Index Fungorum (Q1860469). The team is exploring directions to leverage and disseminate their collections.^[47] Moreover, this would let soweego scale up to a totally different domain. Hence, we believe the collaboration would yield a 3-fold benefit: for Wikidata, Kew, and soweego.

COVID-19 planning

We acknowledge the receipt of the e-mail sent by the Project Grant program officers, regarding WMF requirements updates on the current COVID-19 health emergency. We detail below the guidelines and how this proposal fully complies with them:

travel and/or offline events are a minor focus of this proposal

The whole #Work package does not include any offline events. On the other hand, a #Budget line includes participation to 2 community events. This line does not explicitly mention specific offline events: the project lead is responsible for their selection.

we can complete the core components of the proposed work plan without offline events or travel

Absolutely, 100% of the #Work package is dedicated to the technical development of the project. All the tasks can be carried out in an online/remote setting.

we are able to postpone any planned offline events or travel until WMF guidelines allow for them, without significant harm to the goals of this project

As mentioned above, there are no specific offline events planned. The project lead is responsible for choosing relevant ones: this choice will be postponed until WMF guidelines allow for it.

how this project would be impacted if travel and offline events prove unfeasible throughout the entire life of this project

This would not be a problem, since we are able to fully convert the dissemination of the project output into online forms. For instance, this would translate in the allocation of more effort on activities M3.1 and M3.2 (see #Work package).

References

↑ en:Record_linkage
↑ en:Supervised_learning
↑ Grants:Project/Hjfocs/soweego
↑ https://soweego.readthedocs.io/
↑ Grants:Project/Rapid/Hjfocs/soweego_1.1
↑ ^a ^b https://xtools.wmflabs.org/ec/wikidata.org/Soweego%20bot
↑ see for instance https://tools.wmflabs.org/mix-n-match/#/group/Music
↑ Grants:Project/Rapid/Hjfocs/soweego_1.1#Endorsements
↑ https://github.com/Wikidata/soweego/stargazers
↑ ^a ^b https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
↑ see more in #What:_things_done
↑ see third main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
↑ 4 catalogs, see Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection
↑ see the feedback loops with data re-users block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
↑ phab:T234976
↑ see the checks against 3rd party databases block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
↑ https://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?orgId=1&from=1472503010563&to=1579733999000&panelId=8&fullscreen
↑ see for instance d:Special:Contributions/PreferentialBot
↑ the examples are fictitious and do not reflect the actual data
↑ https://soweego.readthedocs.io/en/latest/validator.html
↑ https://soweego.readthedocs.io/en/latest/index.html#get-ready
↑ https://soweego.readthedocs.io/en/latest/cli.html#importer
↑ https://soweego.readthedocs.io/en/latest/cli.html#validator-aka-sync
↑ d:User:Soweego_bot
↑ d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_2
↑ https://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-05&end=2018-11-05&limit=250
↑ https://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-07&end=2018-11-13&limit=100
↑ ^a ^b Grants:Project/Hjfocs/soweego/Final#Summary
↑ d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_3
↑ https://www.wikidata.org/wiki/Special:Contributions?offset=20190717122745&limit=100&contribs=user&target=Soweego+bot
↑ mw:Wikibase/Indexing/RDF_Dump_Format#Truthy_statements
↑ plenty of queries use the truthy prefix wdt, see for instance d:Wikidata:SPARQL_query_service/queries/examples#Showcase_Queries
↑ see #How:_the_solution
↑ d:Wikidata:Requests_for_comment
↑ m:Grants:Metrics#Three_shared_metrics
↑ person months
↑ corresponds to 25% of the salary. The rest is funded by the hosting university
↑ corresponds to 50% of the salary. The rest is funded by the hosting university
↑ see #Participants
↑ https://www.glassdoor.com/Salaries/milan-senior-project-manager-salary-SRCH_IL.0,5_IM1058_KO6,28.htm
↑ https://www.glassdoor.com/Salaries/milan-software-architect-salary-SRCH_IL.0,5_IM1058_KO6,24.htm
↑ https://www.glassdoor.com/Salaries/munich-research-assistant-salary-SRCH_IL.0,6_IM1053_KO7,25.htm
↑ https://www.discogs.com/team
↑ https://metabrainz.org/team
↑ https://getsatisfaction.com/imdb/details/employees
↑ https://www.kew.org/science/our-science/departments/biodiversity-and-spatial-analysis/intelligent-data-analysis
↑ https://www.kew.org/sites/default/files/2019-11/Kew%20Future%20Leaders%20Fellowship.pdf#%5B%7B%22num%22%3A102%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22FitV%22%7D%2C-54%5D

Get involved

Participants

Marco Fossati

Hjfocs is a research scientist with a double background in natural languages and information technology. He holds a PhD in computer science at the University of Trento (Q930528).

His profile is highly hybrid and can be defined as an enthusiastic leader of applied research in natural language processing for the Web, backed by a strong dissemination attitude and an innate passion for open knowledge, all blended with software engineering skills and a deep affection for human languages.

He is currently focusing on Wikidata data quality and has been leading soweego since the very first prototype, as well as the StrepHit project, both funded by the Wikimedia Foundation.

Emilio Dorigatti

Edorigatti is a PhD candidate at the Ludwig Maximilian University of Munich (Q55044), where he is applying machine learning methods to the design of personalized vaccines for HIV and cancer, in collaboration with the Helmholtz Zentrum München (Q878592).

He holds a BSc in computer science and a double MSc in data science, with a minor in innovation and entrepreneurship. He also has several years of working experience as a software developer, data engineer, and data scientist. He is mostly interested in Bayesian inference, uncertainty quantification, and Bayesian Deep Learning.

Emilio was a core team member of the StrepHit project.

Massimo Frasson

MaxFrax96 has been a professional software engineer since more than 5 years. He has mainly worked as an iOS developer and game developer at Belka.

He has always had a strong interest in software architecture and algorithm performance. He is currently attending a MSc in computer science at the University of Milan (Q46210) to deepen his knowledge on data science.

Massimo has been involved into soweego since the early stages, and has made key contributions to its development.

Advisor Volunteering my experience from Mix'n'match, amongst others Magnus Manske (talk) 10:47, 17 February 2020 (UTC)
Volunteer I'm part of OpenMLOL, which has a a property in Wikidata (https://www.wikidata.org/wiki/Property:P3762) as the identifier of an author in openMLOL. I'm trying to standardize our author ID with the Wikidata ID. .laramar. (talk) 16:18, 20 February 2020 (UTC)
Volunteer Volunteer Back ache (talk) 10:29, 2 March 2020 (UTC)

Community notification

The links below reference notifications to relevant mailing lists and Wiki pages. They are sorted in descending order of specificity.

Wikidata: https://lists.wikimedia.org/pipermail/wikidata/2020-February/013842.html
Wikidata project chat: d:Wikidata:Project_chat#soweego_2_proposal
Wikidata weekly summary: d:Wikidata:Status_updates/2020_02_24
Wikidata Telegram channel: https://t.me/joinchat/AZriqUj5UagVMHXYzfZFvA
Wiki research: https://lists.wikimedia.org/pipermail/wiki-research-l/2020-February/007127.html
AI: https://lists.wikimedia.org/pipermail/ai/2020-February/000296.html
Wikimedia: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-February/094292.html

Endorsements

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project by clicking the blue button in the infobox, or edit this section directly. (Other constructive feedback is welcome on the discussion page).

This section is for endorsements only. Please post your structured feedback on the discussion page. Thanks!

Past endorsements

soweego 1: Grants:Project/Hjfocs/soweego#Endorsements
soweego 1.1: Grants:Project/Rapid/Hjfocs/soweego_1.1#Endorsements

GitHub stargazers

See https://github.com/Wikidata/soweego/stargazers.

Current endorsements

Very cool project idea. 2001:4CA0:0:F235:2486:8ED5:C3AA:C914 09:30, 27 January 2020 (UTC)
Connections between DBs. Frettie (talk) 15:59, 12 February 2020 (UTC)
Important update/sync mechanism for large third-party catalogs Magnus Manske (talk) 10:20, 17 February 2020 (UTC)
Helps linking item around the world ;) Crazy1880 (talk) 17:16, 19 February 2020 (UTC)
As Wikidata matures and becomes more important for knowledge graphs and the products (commercial or not) built upon them, it also becomes more important to keep its quality high. This project contributes to Wikidata's quality by detecting obsolete links between Wikidata and 3rd party databases, discovering new links between Wikidata and 3rd party databases, and helping synchronize facts between Wikidata and 3rd party databases. (Nicolastorzec (talk) 00:43, 20 February 2020 (UTC)
Matching existing items is our biggest bottleneck in integrating new data sources. Any technological help in this regard is a good thing. The team also looks strong. 99of9 (talk) 01:35, 20 February 2020 (UTC)
sounds reasonable and project site is well designed 193.154.94.42 05:42, 20 February 2020 (UTC)
Link curation is hard and this will continue to make it easier and more efficient. StudiesWorld (talk) 11:55, 20 February 2020 (UTC)
As we import more catalogs, we need more tools to improve the matching process. This will help. - PKM (talk) 21:44, 20 February 2020 (UTC)
To complete previous work and provide more automated tools to data catalogs. Sabas88 (talk) 14:28, 21 February 2020 (UTC)
This is useful, a benefit for Wikidata as well as the other catalogues, and makes it easy and fun to participate in the creation of knowledge records. Sebastian Wallroth (talk) 09:04, 22 February 2020 (UTC)
I've always been a supporter of the project, and continue to be one! Sannita - not just another it.wiki sysop 17:52, 22 February 2020 (UTC)

Any technology that helps linking item between DBs and makes easier to participate it's a good idea Tiputini (talk) 18:07, 23 February 2020 (UTC)
I particularly like the sync-ing with third party databases. 12.39.25.130 15:22, 24 February 2020 (UTC)
Essential for keeping Wikidata updated and usable for rapidly changing databases.-Nizil Shah (talk) 02:14, 26 February 2020 (UTC)
This is potentially valuable. To get the most of it, though, the catalogs generated for Mix 'n' Match must be improved (better identifiers and descriptions to allow decisions without *always* having to click through). Some reflection is also due on why relatively little work has been done with MnM so far. Finally, I'd like to see a clear plan for maintenance and sustainability of the project beyond the project lead's current academic context. Will community members be able to keep running the tool? Ijon (talk) 13:07, 26 February 2020 (UTC)
The proposal promises a good extension of the first proposal, and the first proposal has really shown it's worth. Identifiers and mappings are an important part of Wikidata, and will become crucial for quality efforts. This is an extremely helpful step to further enable a quality checking approach against external data sources. I agree with Ijon that a focus should be put on making it sustainable so that the project results can be kept active without the project ongoing in the future. denny (talk) 15:02, 26 February 2020 (UTC)
As Magnus Manske. Epìdosis 15:28, 26 February 2020 (UTC)
I'd love to see it action. As a regular MnM user, I believe feedback loop and Wikidata content validation are essential and will also be very useful for data donors.--HakanIST (talk) 19:09, 28 February 2020 (UTC)
I would like to see Wikidata landscape linked to other external catalogs and also learn how this process can be further improved. John Samuel 15:13, 29 February 2020 (UTC)
Looks certainly promising. ESM (talk) 06:27, 2 March 2020 (UTC)
It would save many human/volunteer time. Maybe at the beginning will need some human review, I would like to hear how this revisions will be. Please take on account data repositories outside USA and Europe, and non-English-based alphabets, don't get the data biased. Salvador (talk) 05:10, 3 March 2020 (UTC)
Mix'n'Match is a great tool, but needs good input. The ChEBI database is a good example where name matching does not work, and the matching should happen on other properties of the entities in ChEBI being matched to Wikidata (particularly: the InChIKey). If soweego can fill that gap, that would be awesome. Egon Willighagen (talk) 11:29, 4 March 2020 (UTC)
Support This would be a critical addition to the mission and work of Wikidata! Great work so far! Todrobbins (talk) 22:33, 4 March 2020 (UTC)
Its necessary for Wikidata; we need a tool to quickly and easy add statements from sourced databases. Matlin (talk) 10:05, 5 March 2020 (UTC)
Support Important tool for establishing Wikidata as a hub for different databases. --Nw520 (talk) 13:29, 5 March 2020 (UTC)
Support Sounds like a good way of handling large datasets. Richard Nevell (WMUK) (talk) 14:46, 5 March 2020 (UTC)
Support this looks like a great development for GLAM-wiki projecs too! Marta Arosio (WMIT) (talk) 07:27, 9 March 2020 (UTC)
third party databases are super important for wikidata! Icebob99 (talk) 20:58, 11 March 2020 (UTC)
Support Obvious win, promotes w:WP:V and w:MOS:BUILD. Feeding high-confidence linkages back to donor databases will both nourish them and help to highlight any issues of falsely high confidence. LeadSongDog (talk) 15:45, 12 March 2020 (UTC)
connecting data sources is super useful Alexzabbey (talk) 15:21, 13 March 2020 (UTC)
Linking identifiers is arguably the most important role of Wikidata, and soweego can do it more efficiently than any human. Vahurzpu (talk) 17:17, 13 March 2020 (UTC)
Support MargaretRDonald (talk) 15:15, 15 March 2020 (UTC)
This could be a good way to improve Wikidata on a large scale. Also, academic libraries are interested in this sort of thing and could help with data-gathering. Rachel Helps (BYU) (talk) 20:55, 16 March 2020 (UTC)
The work on soweego has been very useful so far and I'd love to see the work on it continue. I really appreciate that it aligns well with our roadmap and the current priorities for Wikidata and the focus on increasing the quality of its data. I have confidence in the people behind this proposal to deliver on their promises. --Lydia Pintscher (WMDE) (talk) 13:30, 17 March 2020 (UTC)
Reduced human maintenance effort Frenzie (talk) 20:08, 20 March 2020 (UTC)
Would help make processes more efficient, and reduce human labour! Support! TheFrog001 (talk) 14:01, 22 March 2020 (UTC)
Support Interesting project Afernand74 (talk) 21:40, 25 March 2020 (UTC)
Very important for Wikidata and Wikimedia ecosystem in general. Tubezlob (talk) 10:03, 29 March 2020 (UTC)
Support I love using Mix'n'Match, I teach it during my Wikidata workshops and this will make it even better! Powerek38 (talk) 05:21, 1 April 2020 (UTC)
Support I am satisfied with the in progress Grants:Project/Hjfocs/soweego and support to keep this going in this proposal. Blue Rasberry (talk) 12:48, 1 April 2020 (UTC)
The distinguishing of the Wiki entities and their interconnection to other databases is an important thing! I'm glad someone's taking care of it. Kommerz (talk) 09:57, 3 April 2020 (UTC)
Support A useful project, as connecting with other databases is really important. --Marcok (talk) 15:19, 3 April 2020 (UTC)
Support Wikidata needs that kind of tools to improve more and more it's contents. Bye, Elisardojm (talk) 07:52, 4 April 2020 (UTC)
This would very much help the cooperation between catalog holders and the open source data community. Beireke1 (talk) 14:25, 6 April 2020 (UTC)
Support Great initiative to facilitate cooperation between third-party catalogs/databases and Wikidata community. Sam.Donvil (talk) 14:42, 6 April 2020 (UTC)
Support Huge support for Soweego 2 from me! The proposal addresses the most difficult and time consuming issues that come up when trying to import external datasets and keep them in sync with Wikidata. This will definitely increase Wikidata quality and connectivity with the rest of the web, while saving countless hours of repetitive maintenance work for community members. NavinoEvans (talk) 12:22, 8 April 2020 (UTC)
Similar to Freebase Review queue and would like to see this developed again somehow for Wikidata. Soweego 2 is a good beginning effort for allowing community review of potential mass uploads through external tools such as OpenRefine, etc. Perhaps a dedicated Tag in the queue could be added to Soweego 2 for OpenRefine to be used for the uploads, so we know that uploads came from OpenRefine tool users. Thadguidry (talk) 21:04, 8 April 2020 (UTC)
Support: sounds useful. Nomen ad hoc (talk) 07:27, 9 April 2020 (UTC).
Support: sounds like a good tool. 2800:A4:3169:A000:A065:E1A:54B:9A9A 19:34, 9 April 2020 (UTC)
Support: Strong support. Looks like a cool project. Ranjithsiji (talk) 14:29, 10 April 2020 (UTC)
This project sounds like a logic next step, first start with addiding data (semi-automatically e.g. Mix'n'match, second step start an automatic feedback loop through AI/machine learning on data extracted from multiple source to keep on improving Wikidata and other sources. Setting up such an feedback sycle and making the software open source, will be valuable for many projects that can use it as a template example or join the data web by linking their own data sources. NAZondervan (talk) 08:57, 13 April 2020 (UTC)
Support: Olea (talk) 17:14, 15 April 2020 (UTC)
Support: Like tears in rain (talk) 16:03, 16 April 2020 (UTC)
Support: Important for keeping and improving the data in Wikidata and keep said data synchronized with outside databases. Tm (talk) 13:47, 18 April 2020 (UTC)
Support:We need to improve the reliabilty of Wikidata items also with automated tools that work like human users, so this tool could be very useful for this purpose. Mess (talk) 08:51, 25 April 2020 (UTC)
Support: It's an important project, as it tackles the issue of synchronizing Wikidata with partner databases, such as MusicBrainz. Beat Estermann (talk) 17:08, 1 May 2020 (UTC)
Mapping WikiData QIDs to third party open data sources such as Musicbrainz is a no brainer. It's beneficial to the entire open web infrastructure and useful in countless ways. Audiodude (talk) 02:21, 2 December 2020 (UTC)
I love it, It's really useful Mr. Ibrahem (talk) 18:47, 1 February 2022 (UTC)

[1] :Record_linkage

[2] :Supervised_learning

[3] Grants:Project/Hjfocs/soweego

[4] ttps://soweego.readthedocs.io/

[5] Grants:Project/Rapid/Hjfocs/soweego_1.1

[xtools-6] ttps://xtools.wmflabs.org/ec/wikidata.org/Soweego%20bot

[7] see for instance https://tools.wmflabs.org/mix-n-match/#/group/Music

[8] Grants:Project/Rapid/Hjfocs/soweego_1.1#Endorsements

[9] ttps://github.com/Wikidata/soweego/stargazers

[roadmap-10] ttps://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed

[11] see more in #What:_things_done

[12] see third main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision

[13] 4 catalogs, see Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection

[14] see the feedback loops with data re-users block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed

[15] :T234976

[16] see the checks against 3rd party databases block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed

[17] ttps://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?orgId=1&from=1472503010563&to=1579733999000&panelId=8&fullscreen

[18] see for instance d:Special:Contributions/PreferentialBot

[19] the examples are fictitious and do not reflect the actual data

[20] ttps://soweego.readthedocs.io/en/latest/validator.html

[21] ttps://soweego.readthedocs.io/en/latest/index.html#get-ready

[22] ttps://soweego.readthedocs.io/en/latest/cli.html#importer

[23] ttps://soweego.readthedocs.io/en/latest/cli.html#validator-aka-sync

[24] :User:Soweego_bot

[25] :Wikidata:Requests_for_permissions/Bot/Soweego_bot_2

[26] ttps://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-05&end=2018-11-05&limit=250

[27] ttps://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-07&end=2018-11-13&limit=100

[report-28] Grants:Project/Hjfocs/soweego/Final#Summary

[29] :Wikidata:Requests_for_permissions/Bot/Soweego_bot_3

[30] ttps://www.wikidata.org/wiki/Special:Contributions?offset=20190717122745&limit=100&contribs=user&target=Soweego+bot

[31] w:Wikibase/Indexing/RDF_Dump_Format#Truthy_statements

[32] ty of queries use the truthy prefix wdt, see for instance d:Wikidata:SPARQL_query_service/queries/examples#Showcase_Queries

[33] see #How:_the_solution

[34] :Wikidata:Requests_for_comment

[35] :Grants:Metrics#Three_shared_metrics

[36] rson months

[37] rresponds to 25% of the salary. The rest is funded by the hosting university

[38] rresponds to 50% of the salary. The rest is funded by the hosting university

[39] see #Participants

[40] ttps://www.glassdoor.com/Salaries/milan-senior-project-manager-salary-SRCH_IL.0,5_IM1058_KO6,28.htm

[41] ttps://www.glassdoor.com/Salaries/milan-software-architect-salary-SRCH_IL.0,5_IM1058_KO6,24.htm

[42] ttps://www.glassdoor.com/Salaries/munich-research-assistant-salary-SRCH_IL.0,6_IM1053_KO7,25.htm

[43] ttps://www.discogs.com/team

[44] ttps://metabrainz.org/team

[45] ttps://getsatisfaction.com/imdb/details/employees

[46] ttps://www.kew.org/science/our-science/departments/biodiversity-and-spatial-analysis/intelligent-data-analysis

[47] ttps://www.kew.org/sites/default/files/2019-11/Kew%20Future%20Leaders%20Fellowship.pdf#%5B%7B%22num%22%3A102%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22FitV%22%7D%2C-54%5D

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]