Grants:Project/Hjfocs/soweego/Final


Report under review
This Project Grant report has been submitted by the grantee, and is currently being reviewed by WMF staff. You may add comments, responses, or questions to this report's discussion page.


Welcome to this project's final report! This report shares the outcomes, impact and learnings from the grantee's project.

Part 1: The Project

edit

Summary

edit
       Expected outcomes, see goals and progress
  1. Linker: an artificial intelligence that harnesses machine learning and links people in Wikidata to target catalogs;
  2. 4 large catalogs: Discogs (Q504063), Internet Movie Database (Q37312), MusicBrainz (Q14005), and X (Q918) are linked;
  3. 254,917 confident identifiers to be created or referenced in Wikidata;
  4. 125,866 medium-confident identifiers to be curated via Mix'n'match;
  5. standalone application: packaged with Docker, living in a Wikimedia Cloud VPS machine;
  6. codebase: written in pure Python, 1,700+ commits, 125 closed pull requests.
       Additional outcomes, besides the goals
  1. Validator: sync Wikidata to target catalogs (experimental);
  2. extensible architecture: the community can add new catalogs in a few steps;
  3. pipeline: import, link and sync a catalog to Wikidata in one shot;
  4. works linker: also link Wikidata items about works;
  5. enrichment: generate Wikidata statements about works made by people.

Project Goals

edit

Please copy and paste the project goals from your proposal page. Under each goal, write at least three sentences about how you met that goal over the course of the project. Alternatively, if your goals changed, you may describe the change, list your new goals and explain how you met them, instead.

  • G1: to ensure live maintenance of identifiers for people in Wikidata, via link validation;
    • the team allocated much more effort than expected on this goal;
    • we raised the priority in order to address key feedback from the review committee[1] and the Wikidata development team;[2]
    • the first validator check[3] takes care of the maintenance part;
    • the live side is guaranteed through a Wikimedia Cloud VPS machine[4] that performs regular validation runs;[5]
  • G2: to develop a set of linking techniques that align people in Wikidata to corresponding identifiers in external catalogs;
    • we first designed the baseline, a set of rule-based techniques detailed in a B.Sc. thesis;[6]
    • the baseline served as the main ingredients (read features) for the machine learning recipes (read supervised techniques);
    • we employed a set of algorithms that leverage these features, namely Bernoulli Naïve Bayes,[7] Linear Support Vector Machines,[8] Single-layer Perceptrons,[9] and Multi-layer Perceptrons;[10]
  • G3: to ingest links into Wikidata, either through a bot (confident links), or mediated by curation (non-confident links) through the primary sources tool;
    • the soweego bot[11] handles Wikidata uploads of confident links;
    • see its contributions;[12]
    • instead of the primary sources tool, we opted for Mix'n'match[13] for medium-confident links;
    • soweego is now a close Mix'n'match friend: the former caters for large catalogs, the latter for small ones;
  • G4: to achieve exhaustive coverage (ideally 100%) of identifiers over 4 large-scale trusted catalogs;
    • the big fishes phase[14] resulted in the selection of 4 large catalogs and coverage estimates;
    • Discogs (Q504063), Internet Movie Database (Q37312), MusicBrainz (Q14005) are the core ones that drove the development of soweego. X (Q918) follows a separate path;[15]
    • exhaustive coverage is technically achieved, but at the cost of quality. We decided to cut out non-confident links for the sake of data quality, still getting satisfactory results;
  • G5: to deliver a self-sustainable application that can be easily operated by the community after the end of the project.
    • soweego is packaged as a standalone piece of software with Docker;[16]
    • its production deployment can be accessed by Wikimedia users upon request to the Cloud VPS machine administrators;[17]
    • the team focused on the best trade-off between ease of use and flexibility to enable the addition of new catalogs.[18]

Project Impact

edit

Important: The Wikimedia Foundation is no longer collecting Global Metrics for Project Grants. We are currently updating our pages to remove legacy references, but please ignore any that you encounter until we finish.

Targets

edit
  1. In the first column of the table below, please copy and paste the measures you selected to help you evaluate your project's success (see the Project Impact section of your proposal). Please use one row for each measure. If you set a numeric target for the measure, please include the number.
  2. In the second column, describe your project's actual results. If you set a numeric target for the measure, please report numerically in this column. Otherwise, write a brief sentence summarizing your output or outcome for this measure.
  3. In the third column, you have the option to provide further explanation as needed. You may also add additional explanation below this table.
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
Bot request approval d:Wikidata:Requests_for_permissions/Bot/Soweego_bot, d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_2, d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_3 The core proposal task is approved, as well as two extra ones.
Curated statements / new Mix'n'match users ratio[19] We are not there yet At the time of writing this report, the statements uploaded to Mix'n'match are only a few days old. Although curation is already happening (and that is a positive sign), we believe it is more reasonable to set this measure within at least 6 months.
Total Wikidata identifier statements to be created or referenced 254,917 This measure corresponds to the core task[20] of the bot: the result is the sum of all confident links from all target catalogs.
Total identifiers to be uploaded to Mix'n'match 125,866 Sum of all medium-confident links from all target catalogs.
Total people involved 72 Sum of soweego developers, project advisor, ethic advisors, volunteers, target catalog members, project page watchers, and Wikidata users who provided feedback.


Extra content

edit

We expect to upload additional Wikidata statements corresponding to task 2[21] and task 3[22] of the bot. These will dramatically increase the third target result. As soon as they are generated, we will compute their total amount.

Story

edit

Looking back over your whole project, what did you achieve? Tell us the story of your achievements, your results, your outcomes. Focus on inspiring moments, tough challenges, interesting anecdotes or anything that highlights the outcomes of your project. Imagine that you are sharing with a friend about the achievements that matter most to you in your project.

  • This should not be a list of what you did. You will be asked to provide that later in the Methods and Activities section.
  • Consider your original goals as you write your project's story, but don't let them limit you. Your project may have important outcomes you weren't expecting. Please focus on the impact that you believe matters most.

soweego was born with a mission: link Wikidata people to large third-party catalogs. Although it is still a young artificial intelligence, it has already gone beyond that: now it also links Wikidata works and can connect them with people who made them. soweego lives in its natural habitat: the Wikimedia Cloud VPS.[23] You can pay it a visit at its place: a virtual machine[4] house. Just ask the address to its parents Hjfocs and MaxFrax96.

"Let your robot behave like a human would"

said the wise advisor Magnus Manske during a conversation with Hjfocs (and by the way, the same wise advisor got inspired back by soweego while thinking about the big ones).[24]

soweego made its first linking steps with a set of rules: the baseline. This was not enough, so it learnt to run by trying out several shoes: Naïve Bayes, Support Vector Machines, Single-layer and Multi-layer Perceptrons. The one that fit best was the single-layer perceptron, and that is what it wears now whenever it goes outside to link Wikidata with other catalogs. If it is confident, soweego directly puts links in Wikidata with the help of a sister bot;[11] if it is not sure, it prefers to leave them to its close friend Mix'n'match[13] for curation by the community.

After a careful selection phase, soweego decided to catch 4 big fishes: Discogs (Q504063), Internet Movie Database (Q37312), MusicBrainz (Q14005), and X (Q918) (it actually left the fourth to another friend called SocialLink).[15] This is also because Mix'n'match was already doing a good job with small fishes.

If you want to help soweego catch another big fish, it is not so difficult: you just have to give it the right fishing net![18]

Inspiring moments

edit
  • Understand how to address the most valuable comments:
    • "We are exploring checking Wikidata's data against other databases in order to find mismatches that point to issues in the data. Having more connections via identifiers is a precondition for that." by Lydia Pintscher (WMDE);[25]
    • "Develop a plan for syncing when external database and Wikidata diverge. It may be beyond the scope of the project to address this issue fully, but we would like you to think about how it might be addressed." by LZia (WMF);[1]
    • "A tool that has the perspective to evolve in a way where normal users can add new databases in a reasonable timeframe would on the other hand provide a lot of value." by ChristianKl;[26]
  • Wikidata is alive: the feedback loop when your bot makes mistakes is immediate![27]
  • our advisor Magnus Manske suggested that "it would be great to have a report page of our bot": no need to build that, the report is already there![28]
  • stand on the shoulder of giants: we believe that requests is the most Pythonic Python library, so we took its documentation[29] as gold.

Tough challenges

edit
  • Cope with the (big) waves of discussion[30][31] that surfaced during the project proposal phase;
  • address the key community need: find the best trade-off to allow the addition of new catalogs;[26]
  • grasp the complex pandas[32] Python library and make it work for the soweego use case;
  • efficiency paranoia: tackle memory errors, enable parallel processing, and the like.

Interesting anecdotes

edit
  • Feeling emotions at the 999th commit and the 300th task;
  • simple is better! We unexpectedly found that a single-layer perceptron outperformed a multi-layer one in all the target catalogs;
  • Do you want to miss Java?[33] Just build a complex Python[34] project!
  • See the deep learning meme below.


Methods and activities

edit
 
The soweego team made it to the 999th commit!

Please provide a list of the main methods and activities through which you completed your project.

  • The project has been managed with the same methods as per the midpoint report;[35]
  • we would also like to highlight again the efforts dedicated to the very first tasks: target selection and coverage estimation. See the timeline[36][37][38] and the midpoint report.[39]

soweego development

edit

Each team member followed the typical open-source workflow:

  1. claim a story (multiple tasks) or a single task from the work board;[40]
  2. check out a new branch (with a meaningful name) from the master one of the code repository;[41]
  3. submit a pull request[42] describing what has been done to resolve the claimed story or task;
  4. elicit code review by other team members;
  5. resolve comments and integrate relevant requested changes;
  6. merge the reviewed pull request into master;
  7. safely delete the merged branch.

See an example of a big pull request.[43]

 
The soweego team got over the 300th task!

Machine learning

edit

The core component of soweego is iteratively built on machine learning experiments as follows:

  1. develop a new desired functionality;
  2. integrate it into the system;
  3. run performance evaluation over all target catalog datasets;
    • the default evaluation method is stratified 5-fold cross-validation[44] with average precision, recall,[45] and F1[46] performance scores;
    • we previously implemented more methods like nested cross-validation,[47] which can be used accordingly;
  4. report results;
  5. compare with the prior version of the system;
  6. decide whether to merge the new functionality or not.

See an experiment example.[48]

Output datasets

edit
 
Example of soweego catalogs uploaded to Mix'n'match

Each machine learning algorithm yields different output datasets. Here are the steps adopted to understand what data should be sent to Wikidata and to Mix'n'match:

  • evaluate the algorithm over all target catalog input datasets;
  • run the algorithm;
  • plot the confidence score distribution of each output dataset;
  • extract a small sample and manually assess its effectiveness.

See a confidence score distribution example.[49]

Ethical decision making

edit

As suggested by the Project Grants program officer Marti (WMF),[50] we illustrate here a set of ethical considerations.

  1. Background:
    • the main soweego use case targets Wikidata items about people;[51]
    • privacy concerns were raised at a very early stage of the project proposal;[52]
  2. the main issues at stake were:
    • posting personal information is harassment;
    • this proposal will make people's personal life and data too easy to find;
    • social media are not reliable sources;
  3. in order to address these concerns, the review committee recommended to have an ethic advisor on board;[53]
    • we strived to fill this role as timely as possible;[54]
    • Piotrus volunteered to serve as the main ethic advisor;
    • CristianCantoro made himself available as a co-advisor;
  4. in order to cope with the privacy issues summarized above, the development of soweego was driven by the following decisions:
    • to comply with the Wikidata living people policy;[55]
    • to prevent the creation of Wikidata statements with properties that may violate privacy;[56]
    • to prevent the creation of Wikidata statements with properties likely to be challenged;[57]
    • the soweego bot[11] must fully respect the guidelines on bot interaction with items for living people;[58]
    • to exclusively leverage verified accounts when processing the social medium target catalog X (Q918).[59]

Project resources

edit

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Even if you have linked to them elsewhere in this report, this section serves as a centralized archive for everything you created during your project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

Software

edit

Data

edit

The following resources correspond to the soweego production deployment[4] output.

Datasets gathered from Wikidata

edit

Machine learning models

edit
edit

Performance evaluations

edit

Commons uploads

edit

Outreach

edit

Learning

edit

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well

edit

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What didn’t work

edit

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

soweego is a piece of software, and all challenges have revolved around technical aspects:

  • the Python programming language[34] is a double-edged sword: it is a perfect fit for simple projects, but it becomes very painful as the complexity grows, and you will miss Java;[33]
    • we believe there is no better IDE (Integrated Development Environment)[60] than the IPython console:[61] this looks like a very bad sign;
    • code refactoring can be quite error-prone even if you use mature IDEs like PyCharm:[62] it may unexpectedly break your project and will make you waste time;
    • there is no standard way for building code documentation: several markup languages and multiple documentation libraries increase the chaos. Tons of ways to do the same thing;
  • you are searching a suitable software library for your use case; after some investigation, you find it, and you start playing with it: it seems a perfect fit. Then, the deeper you dig, the more customization you need. You now feel you should have implemented your use case from scratch;
  • start small, then get bigger, not viceversa: architectural choices like the early adoption of too complex abstraction libraries for relatively simple scenario have the taste of overengineering.

Next steps and opportunities

edit

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

  • The next short-term step is detailed in the soweego 1.1 rapid grant proposal;[63]
  • the crucial growth opportunity centers on the validator component:[64]
    • it was initially out of this grant scope;
    • we conceived it to address feedback from the Wikidata development team and the Wikimedia research team: "checking Wikidata's data against other databases in order to find mismatches" and "Develop a plan for syncing when external database and Wikidata diverge" respectively;
    • first results look promising;
  • we are going to apply for a renewal of this grant that focuses on the validator;
  • soweego is an open-source project: we should allocate energy to build a community around it;
  • its impact will increase if we add new catalogs: it will be vital to attract contributors for that task.

Part 2: The Grant

edit

Finances

edit

Actual spending

edit

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

Expense Approved amount Actual funds spent Difference
Project leader 53,348 € 54,396 € Balancing based on the Dissemination item.
Software architect 20,690 € 20,690 €
Dissemination(1) 1,200 € 152 € WikiCite 2018 was partially covered by a WMF scholarship.
Total 75,238 € 75,238 €


(1) The project leader will also attend WikidataCon 2019, which takes place after the time span of this grant.

Remaining funds

edit

Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.

No.

Documentation

edit

Did you send documentation of all expenses paid with grant funds to grantsadmin wikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

Yes.

Confirmation of project status

edit

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

Yes.

Is your project completed?

Please answer yes or no.

Yes. We still plan to apply for a renewal of this grant.

Grantee reflection

edit

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being a grantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the Project Grant experience? Please share it here!

This project has confirmed that the Wikimedia grantee experience is invaluable and should be regarded as an extraordinary chance: you support open knowledge, you promote open source, you feel part of a community. Do you need anything else?

The most demanding stage happened at the very beginning of this journey: community review. The proposal attracted a large volume of reactions, which touched very distinct topics. We did all we could to undertake every single comment, and to rephrase the project idea appropriately. This probably resulted in a less strong proposal, though. Therefore, I have a few suggestions for the WMF Program Officers that may improve this side of the Project Grants:

  • to make a strict distinction between a proposal discussion page and the endorsements section. The former is the ideal place for any kind of constructive feedback, while the latter should be exclusively reserved to short statements of support;
  • to detail the community notification section in more depth. In other words, I think it is essential to emphasize that very specific communities will be the best candidates for the most relevant reviews;
  • to highlight that grantees are indeed encouraged to integrate community feedback, but are not required to reply to everyone. This message should hold for both grantees and the community.

References

edit
  1. a b see 3rd main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
  2. quoting Lydia_Pintscher_(WMDE)'s endorsement in Grants:Project/Hjfocs/soweego#Endorsements: "We are exploring checking Wikidata's data against other databases in order to find mismatches that point to issues in the data. Having more connections via identifiers is a precondition for that."
  3. https://soweego.readthedocs.io/en/latest/validator.html#soweego.validator.checks.dead_ids
  4. a b c https://tools.wmflabs.org/openstack-browser/project/soweego
  5. https://soweego.readthedocs.io/en/latest/pipeline.html#cron-jobs
  6. https://tools.wmflabs.org/soweego/MaxFrax96_BSc_thesis.pdf
  7. en:Naive_Bayes_classifier#Bernoulli_naive_Bayes
  8. en:Support-vector_machine#Linear_SVM
  9. en:Perceptron
  10. en:Multilayer_perceptron
  11. a b c d:User:Soweego bot
  12. d:Special:Contributions/Soweego_bot
  13. a b https://tools.wmflabs.org/mix-n-match/
  14. Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection
  15. https://docs.docker.com/install/
  16. Hjfocs and MaxFrax96
  17. a b https://soweego.readthedocs.io/en/latest/new_catalog.html
  18. Note that we merged two proposed measures as per LZia (WMF)'s recommendation, see 2nd bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
  19. d:Wikidata:Requests_for_permissions/Bot/Soweego_bot
  20. d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_2
  21. d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_3
  22. wikitech:Portal:Cloud_VPS
  23. http://magnusmanske.de/wordpress/?p=471
  24. Grants:Project/Hjfocs/soweego#Endorsements
  25. a b Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability
  26. d:User_talk:Hjfocs
  27. https://xtools.wmflabs.org/ec/wikidata.org/Soweego%20bot
  28. http://docs.python-requests.org
  29. Grants_talk:Project/Hjfocs/soweego
  30. Grants:Project/Hjfocs/soweego#Endorsements
  31. https://pandas.pydata.org/
  32. a b https://go.java/
  33. a b https://www.python.org/
  34. Grants:Project/Hjfocs/soweego/Midpoint#Project_management
  35. Grants:Project/Hjfocs/soweego/Timeline#July_2018:_target_selection_&_small_fishes
  36. Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection
  37. Grants:Project/Hjfocs/soweego/Timeline#Motivation_#2:_coverage_estimation
  38. Grants:Project/Hjfocs/soweego/Midpoint#Target_selection
  39. https://github.com/Wikidata/soweego/projects/1
  40. https://github.com/Wikidata/soweego/tree/master
  41. https://github.com/Wikidata/soweego/pulls
  42. https://github.com/Wikidata/soweego/pull/339
  43. en:Cross-validation_(statistics)#k-fold_cross-validation
  44. en:Precision_and_recall
  45. en:F1_score
  46. en:Cross-validation_(statistics)#Nested_cross-validation
  47. https://soweego.readthedocs.io/en/latest/experiments.html#string-kernel-feature
  48. https://github.com/Wikidata/soweego/files/3161811/imdb_director_mlp.pdf
  49. See first bullet point in Grants_talk:Project/Hjfocs/soweego/Midpoint
  50. see summary of Grants:Project/Hjfocs/soweego
  51. Grants_talk:Project/Hjfocs/soweego#Privacy_issues_with_personal_information
  52. See last bullet point in the Additional comments from the Committee section of Grants_talk:Project/Hjfocs/soweego#Aggregated_feedback_from_the_committee_for_soweego
  53. Grants:Project/Hjfocs/soweego/Midpoint#Ethic_advisor
  54. d:Wikidata:Living_people
  55. d:Wikidata:WikiProject_Properties/Wikidata_properties_that_may_violate_privacy
  56. d:Wikidata:WikiProject_Properties/Wikidata_properties_likely_to_be_challenged
  57. d:Wikidata:Living_people#Bot_interaction_with_items_for_living_persons
  58. Grants_talk:Project/Hjfocs/soweego/Timeline#Privacy
  59. en:Integrated_development_environment
  60. https://ipython.org/
  61. https://www.jetbrains.com/pycharm/
  62. Grants:Project/Rapid/Hjfocs/soweego_1.1
  63. https://soweego.readthedocs.io/en/latest/validator.html