Grants:Project/Hjfocs/soweego 2

statusproposed
soweego 2
Crosscut.jpg
summarysoweego is an artificial intelligence that links Wikidata to large external catalogs.
Now soweego wants to enable feedback loops between Wikidatans and catalog owners, through mutual data quality checks.
Unity is strength.
targetWikidata
type of granttools and software, research
amount80k €
type of applicantindividual
granteeHjfocs
advisorMagnus Manske
contact• fossati(_AT_)spaziodati.eu
volunteer.laramar.
this project needs...
volunteer
join
endorse
created on15:03, 21 January 2020 (UTC)


IdeaEdit

soweego is a machine learning system that connects Wikidata to large-scale third-party catalogs. It takes a set of Wikidata items and a given catalog as input, and links them through record linkage[1] techniques based on supervised learning.[2] The main output is a dataset of Wikidata and third-party catalog identifier pairs.

Running exampleEdit

Elvis Presley is Q303 in Wikidata and 01809552 in MusicBrainz: soweego links Q303 to 01809552 through a Wikidata identifier statement.

 
soweego's core task explained with the help of Elvis Presley (Q303).

The story so farEdit

The project started with a 1-year Project Grant,[3] which led to the release of version 1.[4] A Rapid Grant[5] followed for version 1.1. Outcomes look promising:

  • the soweego Wikidata bot uploaded hundreds of thousands confident links;[6]
  • medium-confident ones are in Mix'n'match (Q28054658) for curation;[7]
  • there is strong community support;[8][9]
  • outcomes fit the Wikidata development roadmap.[10]

Growth directionsEdit

We see two principal growth directions:

  1. the validator component;
  2. addition of new third-party catalogs.

The former was originally out of the initial Project Grant scope. Nevertheless, we developed a prototype[11] to address a key suggestion from the Wikimedia research team: "Develop a plan for syncing when external database and Wikidata diverge".[12] The latter is a natural way to expand beyond the initial use case,[13] thus increasing impact through more extensive coverage. Contributors engagement will be crucial for this task.

Why: the problemEdit

soweego complements the Wikidata development roadmap with respect to the Increase data quality and trust part.[10] It aims at addressing three open challenges that zoom in from a high-level perspective:

  1. missing feedback loops between Wikidata data donors and re-users;[14][15]
  2. lack of methodical efforts to keep Wikidata in sync with third-party catalogs/databases;[16]
  3. under-usage of the statement ranking system,[17] with few bots performing most of the edits.[18]

These challenges are intertwined among each other: synchronizing Wikidata to a given external database is a precondition to enable a feedback loop between both communities. At the same time, sync results can impact the ranking system usage.

How: the solutionEdit

We synchronize Wikidata to a given target catalog at a given point in time through a set of validation criteria.

  1. existence: whether a target identifier found in a given Wikidata item is still available in the target catalog;
  2. links: to what extent all URLs available in a Wikidata item overlap with those in the corresponding target catalog entry;
  3. metadata: to what extent relevant statements available in a Wikidata item overlap with those in the corresponding target catalog entry.

The application of these criteria to our running example translates into the following actions.[19]

  1. Elvis Presley (Q303) has a MusicBrainz identifier 01809552, which does not exist in MusicBrainz anymore.
    Action = mark the identifier statement with a deprecated rank;
  2. Elvis Presley (Q303) has 7 URLs, MusicBrainz 01809552 has 8 URLs, and 3 overlap.
    Action = add 5 URLs from MusicBrainz to Elvis Presley (Q303) and submit 4 URLs from Wikidata to the MusicBrainz community;
  3. Wikidata states that Elvis Presley (Q303) was born on January 8, 1935 in Tupelo, while MusicBrainz states that 01809552 was born in 1934 in Memphis.
    Action = add 2 referenced statements with MusicBrainz values to Elvis Presley (Q303) and notify 2 Wikidata values to the MusicBrainz community.

In case of either full or no overlap in criteria 2 and 3, the Wikidata identifier statement should be marked with a preferred or a deprecated rank respectively. Note that community discussion is essential to refine these criteria.

What: things doneEdit

 
Example of soweego catalogs uploaded to Mix'n'match (Q28054658)

soweego 1 has an experimental validator module[20] that implements the aforementioned criteria. If you know how to use a command line and feel audacious, you can install soweego,[21] import a target catalog,[22] and try the validator out.[23]

The soweego bot[24] already has an approved task for criterion 2,[25] together with a set of test edits.[26] In addition, we performed (then reverted) a set of test edits for criterion 1.[27]

Besides that, major contributions in terms of content addition are:

  • roughly 255,000[28] confident identifier statements uploaded by the soweego bot, totalling 482,000[6] Wikidata edits;
  • around 126,000[28] medium-confident identifiers submitted to Mix'n'match (Q28054658) for curation.

GoalsEdit

  • G1: take the soweego validator component from experimental to stable;
  • G2: submit validation results to the target catalog providers;
  • G3: engage the Wikidata community via effective communication of soweego results;
  • G4: expand soweego coverage to additional target catalogs.

ImpactEdit

OutputEdit

  • O1: production-ready validator module, implementing criteria acknowledged by the community;
  • O2: datasets to enable feedback loops on the Wikidata users side, namely
    • automatically ranked identifier dataset, as a result of validation criteria actions;
    • entity enrichment statement dataset, based on available target data;
  • O3: datasets to enable feedback loops on the target catalog providers side, namely rotten URLs and additional URLs & values;
  • O4: engagement tools for Wikidata users, including visualization of soweego datasets and data curation tutorials;
  • O5: procedure to plug a new target catalog that minimizes programming efforts.

OutcomesEdit

  1. Feedback loop between target data donors and Wikidata users:
    as a target catalog consumer, soweego can shift the target maintenance burden. Edits performed by the soweego bot (properly communicated through visualization for instance) close the loop from the Wikidata users side. A use case that emerged during the development of soweego 1 is the sanity check over target URLs: this yielded a rotten URLs dataset, which can be submitted to the target community;
  2. checks against target catalogs, which entail the enrichment of Wikidata items upon available data, especially relationships among entries. Use cases encountered in soweego 1 target catalogs include:
    • works by people, with an approved task[29] and corresponding test edits;[30]
    • band membership of musicians;
  3. automatic ranking of statements: a potentially huge impact on the Wikidata truthy statements dumps,[31] which are both widely used as the "official" ones and underpin a vast portion of the Query Service.[32]
 
soweego advertisement gone wrong at Wikimania 2019 :-D. Spotted by Sj (thanks!)

Local metricsEdit

The following numerical metrics are projections over all target catalogs currently supported by soweego and all experimental validation criteria,[33] based on estimates for a single catalog (i.e., Discogs (Q504063)) and a single criterion. Note that the actual amount of total statements will depend on data available in Wikidata and in target catalogs.

  • Validator datasets (O2):
  1. 250k ranked statements. Estimate: 21k identifier statements to be deprecated, as a result of criterion 1;
  2. 120k new statements. Estimate: 10k statements to be added or referenced, as a result of criterion 2;
  • feedback loop datasets, target side (O3):
    1. 440k rotten URLs. Estimate: 110k URLs, as a result of the sanity check at import time;
    2. 128k extra values. Estimate: 16k values, as a result of criterion 3.

    From a qualitative perspective, a project-specific request for comment[34] will act as a survey to collect feedback.

    Shared metricsEdit

    The 3 shared metrics[35] are as follows.

    1. 50 total participants: sum of target catalog community fellows, Wikidata users facilitating the feedback loop, and contributors to the soweego system;
    2. 25 newly registered users: to be gathered from the Mix'n'match (Q28054658) user base;
    3. 370k content pages created or improved: sum of Wikidata statements edited by the soweego bot.

    PlanEdit

    Work packageEdit

    High-level milestones
    ID Title Goal Month Effort
    M1 Validator G1 1-8 30%
    M2 Feedback loop, data providers side G2 3-12 25%
    M3 Feedback loop, data users side G3 3-12 25%
    M4 New catalogs G4 8-12 20%

    Note that some milestones overlap in terms of timespan: this is required to exploit mutual interactions among them.

    Milestones breakdown: lower-level stories
    ID Title Action Effort
    M1.1 Validation criteria Refine criteria through community discussion 35%
    M1.2 Automatic ranking Submit validation criteria actions on Wikidata statement ranks 20%
    M1.3 Automatic enrichment Leverage target catalog relationships to generate Wikidata statements 20%
    M1.4 Interaction with constraints check Intercept reports of this Wikibase extension to improve soweego results 25%
    M2.1 Rotten URLs Contribute rotten URLs to target catalog providers 30%
    M2.2 Extra URLs & content Submit additional URLs (criterion 2) and values (criterion 3) from Wikidata to target catalog providers 70%
    M3.1 soweego bot dashboard Improve communication of Wikidata edits made by soweego via data visualization 35%
    M3.2 Data curation guidelines Explain how to curate Wikidata and Mix’n’match contributions made by soweego 30%
    M3.3 Real-world evaluation Switch from in vitro to in situ evaluation of the soweego system 35%
    M4.1 New domains Extend support of target catalogs beyond people and works (initial use case) 20%
    M4.2 Codebase generalization Minimize domain-dependent logic 30%
    M4.3 Simple import Squeeze the effort needed to add support for a new catalog 50%

    BudgetEdit

    The total amount requested is 80,318 €.

    Budget breakdown
    Item Description Commitment Cost
    Project lead Responsible for the full project implementation Full time (40 hrs/week), 12 PM[36] 52,735 €
    Core system architect Technical operations head Full time, 12 PM 12,253 €[37]
    Research assistant In charge of the applied machine learning components Part time (20 hrs/week), 6 PM 14,330 €[38]
    Dissemination Expenses to attend 2 relevant community events One shot 1,000 €
    Total 80,318

    Gross salaries of the human resources are computed upon average estimates based on roles and locations.[39] Gross labor rates per hour follow.

    • project lead: 27.46 €;[40]
    • core system architect: 25.52 €;[41]
    • research assistant: 29.85 €.[42]

    Note that the project lead will serve as the main grantee and will appropriately allocate the funding.

    Community engagementEdit

    We identify the following relevant communities, which are both inside and outside the Wikimedia landscape.

    Besides the currently supported target catalogs, we would like to engage Royal Botanic Gardens, Kew (Q18748726) through a formal collaboration with the Intelligent Data Analysis team.[46] The organization maintains a set of biodiversity catalogs, such as Index Fungorum (Q1860469). The team is exploring directions to leverage and disseminate their collections.[47] Moreover, this would let soweego scale up to a totally different domain. Hence, we believe the collaboration would yield a 3-fold benefit: for Wikidata, Kew, and soweego.

    ReferencesEdit

    1. en:Record_linkage
    2. en:Supervised_learning
    3. Grants:Project/Hjfocs/soweego
    4. https://soweego.readthedocs.io/
    5. Grants:Project/Rapid/Hjfocs/soweego_1.1
    6. a b https://xtools.wmflabs.org/ec/wikidata.org/Soweego%20bot
    7. see for instance https://tools.wmflabs.org/mix-n-match/#/group/Music
    8. Grants:Project/Rapid/Hjfocs/soweego_1.1#Endorsements
    9. https://github.com/Wikidata/soweego/stargazers
    10. a b https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
    11. see more in #What:_things_done
    12. see third main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
    13. 4 catalogs, see Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection
    14. see the feedback loops with data re-users block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
    15. phab:T234976
    16. see the checks against 3rd party databases block in https://eu.roadmunk.com/publish/f247f2e06ee338c9997893bd1f9a696fbf2a40ed
    17. https://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?orgId=1&from=1472503010563&to=1579733999000&panelId=8&fullscreen
    18. see for instance d:Special:Contributions/PreferentialBot
    19. the examples are fictitious and do not reflect the actual data
    20. https://soweego.readthedocs.io/en/latest/validator.html
    21. https://soweego.readthedocs.io/en/latest/index.html#get-ready
    22. https://soweego.readthedocs.io/en/latest/cli.html#importer
    23. https://soweego.readthedocs.io/en/latest/cli.html#validator-aka-sync
    24. d:User:Soweego_bot
    25. d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_2
    26. https://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-05&end=2018-11-05&limit=250
    27. https://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-07&end=2018-11-13&limit=100
    28. a b Grants:Project/Hjfocs/soweego/Final#Summary
    29. d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_3
    30. https://www.wikidata.org/wiki/Special:Contributions?offset=20190717122745&limit=100&contribs=user&target=Soweego+bot
    31. mw:Wikibase/Indexing/RDF_Dump_Format#Truthy_statements
    32. plenty of queries use the truthy prefix wdt, see for instance d:Wikidata:SPARQL_query_service/queries/examples#Showcase_Queries
    33. see #How:_the_solution
    34. d:Wikidata:Requests_for_comment
    35. m:Grants:Metrics#Three_shared_metrics
    36. person months
    37. corresponds to 25% of the salary. The rest is funded by the hosting university
    38. corresponds to 50% of the salary. The rest is funded by the hosting university
    39. see #Participants
    40. https://www.glassdoor.com/Salaries/milan-senior-project-manager-salary-SRCH_IL.0,5_IM1058_KO6,28.htm
    41. https://www.glassdoor.com/Salaries/milan-software-architect-salary-SRCH_IL.0,5_IM1058_KO6,24.htm
    42. https://www.glassdoor.com/Salaries/munich-research-assistant-salary-SRCH_IL.0,6_IM1053_KO7,25.htm
    43. https://www.discogs.com/team
    44. https://metabrainz.org/team
    45. https://getsatisfaction.com/imdb/details/employees
    46. https://www.kew.org/science/our-science/departments/biodiversity-and-spatial-analysis/intelligent-data-analysis
    47. https://www.kew.org/sites/default/files/2019-11/Kew%20Future%20Leaders%20Fellowship.pdf#%5B%7B%22num%22%3A102%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22FitV%22%7D%2C-54%5D

    Get involvedEdit

    ParticipantsEdit

    • Marco Fossati
    Hjfocs is a research scientist with a double background in natural languages and information technology. He holds a PhD in computer science at the University of Trento (Q930528).
    His profile is highly hybrid and can be defined as an enthusiastic leader of applied research in natural language processing for the Web, backed by a strong dissemination attitude and an innate passion for open knowledge, all blended with software engineering skills and a deep affection for human languages.
    He is currently focusing on Wikidata data quality and has been leading soweego since the very first prototype, as well as the StrepHit project, both funded by the Wikimedia Foundation.
    • Emilio Dorigatti
    Edorigatti is a PhD candidate at the Ludwig Maximilian University of Munich (Q55044), where he is applying machine learning methods to the design of personalized vaccines for HIV and cancer, in collaboration with the Helmholtz Zentrum München (Q878592).
    He holds a BSc in computer science and a double MSc in data science, with a minor in innovation and entrepreneurship. He also has several years of working experience as a software developer, data engineer, and data scientist. He is mostly interested in Bayesian inference, uncertainty quantification, and Bayesian Deep Learning.
    Emilio was a core team member of the StrepHit project.
    • Massimo Frasson
    MaxFrax96 has been a professional software engineer since more than 5 years. He has mainly worked as an iOS developer and game developer at Belka.
    He has always had a strong interest in software architecture and algorithm performance. He is currently attending a MSc in computer science at the University of Milan (Q46210) to deepen his knowledge on data science.
    Massimo has been involved into soweego since the early stages, and has made key contributions to its development.
    • Advisor Volunteering my experience from Mix'n'match, amongst others Magnus Manske (talk) 10:47, 17 February 2020 (UTC)
    • Volunteer I'm part of OpenMLOL, which has a a property in Wikidata (https://www.wikidata.org/wiki/Property:P3762) as the identifier of an author in openMLOL. I'm trying to standardize our author ID with the Wikidata ID. .laramar. (talk) 16:18, 20 February 2020 (UTC)

    Community notificationEdit

    The links below reference notifications to relevant mailing lists and Wiki pages. They are sorted in descending order of specificity.

    EndorsementsEdit

    Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project by clicking the blue button in the infobox, or edit this section directly. (Other constructive feedback is welcome on the discussion page).

    This section is for endorsements only. Please post your structured feedback on the discussion page. Thanks!

    Past endorsementsEdit

    GitHub stargazersEdit

    See https://github.com/Wikidata/soweego/stargazers.

    Current endorsementsEdit

    • Very cool project idea. 2001:4CA0:0:F235:2486:8ED5:C3AA:C914 09:30, 27 January 2020 (UTC)
    • Connections between DBs. Frettie (talk) 15:59, 12 February 2020 (UTC)
    • Important update/sync mechanism for large third-party catalogs Magnus Manske (talk) 10:20, 17 February 2020 (UTC)
    • Helps linking item around the world ;) Crazy1880 (talk) 17:16, 19 February 2020 (UTC)
    • As Wikidata matures and becomes more important for knowledge graphs and the products (commercial or not) built upon them, it also becomes more important to keep its quality high. This project contributes to Wikidata's quality by detecting obsolete links between Wikidata and 3rd party databases, discovering new links between Wikidata and 3rd party databases, and helping synchronize facts between Wikidata and 3rd party databases. (Nicolastorzec (talk) 00:43, 20 February 2020 (UTC)
    • Matching existing items is our biggest bottleneck in integrating new data sources. Any technological help in this regard is a good thing. The team also looks strong. 99of9 (talk) 01:35, 20 February 2020 (UTC)
    • sounds reasonable and project site is well designed 193.154.94.42 05:42, 20 February 2020 (UTC)
    • Link curation is hard and this will continue to make it easier and more efficient. StudiesWorld (talk) 11:55, 20 February 2020 (UTC)
    • As we import more catalogs, we need more tools to improve the matching process. This will help. - PKM (talk) 21:44, 20 February 2020 (UTC)
    • To complete previous work and provide more automated tools to data catalogs. Sabas88 (talk) 14:28, 21 February 2020 (UTC)
    • This is useful, a benefit for Wikidata as well as the other catalogues, and makes it easy and fun to participate in the creation of knowledge records. Sebastian Wallroth (talk) 09:04, 22 February 2020 (UTC)
    • I've always been a supporter of the project, and continue to be one! Sannita - not just another it.wiki sysop 17:52, 22 February 2020 (UTC)
    • Any technology that helps linking item between DBs and makes easier to participate it's a good idea Tiputini (talk) 18:07, 23 February 2020 (UTC)
    • I particularly like the sync-ing with third party databases. 12.39.25.130 15:22, 24 February 2020 (UTC)
    • Essential for keeping Wikidata updated and usable for rapidly changing databases.-Nizil Shah (talk) 02:14, 26 February 2020 (UTC)
    • This is potentially valuable. To get the most of it, though, the catalogs generated for Mix 'n' Match must be improved (better identifiers and descriptions to allow decisions without *always* having to click through). Some reflection is also due on why relatively little work has been done with MnM so far. Finally, I'd like to see a clear plan for maintenance and sustainability of the project beyond the project lead's current academic context. Will community members be able to keep running the tool? Ijon (talk) 13:07, 26 February 2020 (UTC)
    • The proposal promises a good extension of the first proposal, and the first proposal has really shown it's worth. Identifiers and mappings are an important part of Wikidata, and will become crucial for quality efforts. This is an extremely helpful step to further enable a quality checking approach against external data sources. I agree with Ijon that a focus should be put on making it sustainable so that the project results can be kept active without the project ongoing in the future. denny (talk) 15:02, 26 February 2020 (UTC)
    • As Magnus Manske. Epìdosis 15:28, 26 February 2020 (UTC)