Welcome to this project's final report! This report shares the outcomes, impact and learnings from the grantee's project.
Part 1: The ProjectEdit
Please copy and paste the project goals from your proposal page. Under each goal, write at least three sentences about how you met that goal over the course of the project. Alternatively, if your goals changed, you may describe the change, list your new goals and explain how you met them, instead.
- G1: to ensure live maintenance of identifiers for people in Wikidata, via link validation;
- the team allocated much more effort than expected on this goal;
- we raised the priority in order to address key feedback from the review committee and the Wikidata development team;
- the first validator check takes care of the maintenance part;
- the live side is guaranteed through a Wikimedia Cloud VPS machine that performs regular validation runs;
- G2: to develop a set of linking techniques that align people in Wikidata to corresponding identifiers in external catalogs;
- we first designed the baseline, a set of rule-based techniques detailed in a B.Sc. thesis;
- the baseline served as the main ingredients (read features) for the machine learning recipes (read supervised techniques);
- we employed a set of algorithms that leverage these features, namely Bernoulli Naïve Bayes, Linear Support Vector Machines, Single-layer Perceptrons, and Multi-layer Perceptrons;
- G3: to ingest links into Wikidata, either through a bot (confident links), or mediated by curation (non-confident links) through the primary sources tool;
- G4: to achieve exhaustive coverage (ideally 100%) of identifiers over 4 large-scale trusted catalogs;
- the big fishes phase resulted in the selection of 4 large catalogs and coverage estimates;
- Discogs (Q504063), Internet Movie Database (Q37312), MusicBrainz (Q14005) are the core ones that drove the development of
soweego. Twitter (Q918) follows a separate path;
- exhaustive coverage is technically achieved, but at the cost of quality. We decided to cut out non-confident links for the sake of data quality, still getting satisfactory results;
- G5: to deliver a self-sustainable application that can be easily operated by the community after the end of the project.
soweegois packaged as a standalone piece of software with Docker;
- its production deployment can be accessed by Wikimedia users upon request to the Cloud VPS machine administrators;
- the team focused on the best trade-off between ease of use and flexibility to enable the addition of new catalogs.
Important: The Wikimedia Foundation is no longer collecting Global Metrics for Project Grants. We are currently updating our pages to remove legacy references, but please ignore any that you encounter until we finish.
- In the first column of the table below, please copy and paste the measures you selected to help you evaluate your project's success (see the Project Impact section of your proposal). Please use one row for each measure. If you set a numeric target for the measure, please include the number.
- In the second column, describe your project's actual results. If you set a numeric target for the measure, please report numerically in this column. Otherwise, write a brief sentence summarizing your output or outcome for this measure.
- In the third column, you have the option to provide further explanation as needed. You may also add additional explanation below this table.
|Planned measure of success
(include numeric target, if applicable)
|Bot request approval||d:Wikidata:Requests_for_permissions/Bot/Soweego_bot, d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_2, d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_3||The core proposal task is approved, as well as two extra ones.|
|Curated statements / new Mix'n'match users ratio||We are not there yet||At the time of writing this report, the statements uploaded to Mix'n'match are only a few days old. Although curation is already happening (and that is a positive sign), we believe it is more reasonable to set this measure within at least 6 months.|
|Total Wikidata identifier statements to be created or referenced||254,917||This measure corresponds to the core task of the bot: the result is the sum of all confident links from all target catalogs.|
|Total identifiers to be uploaded to Mix'n'match||125,866||Sum of all medium-confident links from all target catalogs.|
|Total people involved||72||Sum of |
We expect to upload additional Wikidata statements corresponding to task 2 and task 3 of the bot. These will dramatically increase the third target result. As soon as they are generated, we will compute their total amount.
Looking back over your whole project, what did you achieve? Tell us the story of your achievements, your results, your outcomes. Focus on inspiring moments, tough challenges, interesting anecdotes or anything that highlights the outcomes of your project. Imagine that you are sharing with a friend about the achievements that matter most to you in your project.
- This should not be a list of what you did. You will be asked to provide that later in the Methods and Activities section.
- Consider your original goals as you write your project's story, but don't let them limit you. Your project may have important outcomes you weren't expecting. Please focus on the impact that you believe matters most.
soweego was born with a mission: link Wikidata people to large third-party catalogs.
Although it is still a young artificial intelligence, it has already gone beyond that: now it also links Wikidata works and can connect them with people who made them.
soweego lives in its natural habitat: the Wikimedia Cloud VPS.
You can pay it a visit at its place: a virtual machine house. Just ask the address to its parents Hjfocs and MaxFrax96.
"Let your robot behave like a human would"
soweego made its first linking steps with a set of rules: the baseline.
This was not enough, so it learnt to run by trying out several shoes: Naïve Bayes, Support Vector Machines, Single-layer and Multi-layer Perceptrons.
The one that fit best was the single-layer perceptron, and that is what it wears now whenever it goes outside to link Wikidata with other catalogs.
If it is confident,
soweego directly puts links in Wikidata with the help of a sister bot;
if it is not sure, it prefers to leave them to its close friend Mix'n'match for curation by the community.
After a careful selection phase,
soweego decided to catch 4 big fishes: Discogs (Q504063), Internet Movie Database (Q37312), MusicBrainz (Q14005), and Twitter (Q918)
(it actually left the fourth to another friend called SocialLink).
This is also because Mix'n'match was already doing a good job with small fishes.
If you want to help
soweego catch another big fish, it is not so difficult: you just have to give it the right fishing net!
- Understand how to address the most valuable comments:
- "We are exploring checking Wikidata's data against other databases in order to find mismatches that point to issues in the data. Having more connections via identifiers is a precondition for that." by Lydia Pintscher (WMDE);
- "Develop a plan for syncing when external database and Wikidata diverge. It may be beyond the scope of the project to address this issue fully, but we would like you to think about how it might be addressed." by LZia (WMF);
- "A tool that has the perspective to evolve in a way where normal users can add new databases in a reasonable timeframe would on the other hand provide a lot of value." by ChristianKl;
- Wikidata is alive: the feedback loop when your bot makes mistakes is immediate!
- our advisor Magnus Manske suggested that "it would be great to have a report page of our bot": no need to build that, the report is already there!
- stand on the shoulder of giants: we believe that
requestsis the most Pythonic Python library, so we took its documentation as gold.
- Cope with the (big) waves of discussion that surfaced during the project proposal phase;
- address the key community need: find the best trade-off to allow the addition of new catalogs;
- grasp the complex
pandas Python library and make it work for the
- efficiency paranoia: tackle memory errors, enable parallel processing, and the like.
- Feeling emotions at the 999th commit and the 300th task;
- simple is better! We unexpectedly found that a single-layer perceptron outperformed a multi-layer one in all the target catalogs;
- Do you want to miss Java? Just build a complex Python project!
- See the deep learning meme below.
Methods and activitiesEdit
Please provide a list of the main methods and activities through which you completed your project.
- The project has been managed with the same methods as per the midpoint report;
- we would also like to highlight again the efforts dedicated to the very first tasks: target selection and coverage estimation. See the timeline and the midpoint report.
Each team member followed the typical open-source workflow:
- claim a story (multiple tasks) or a single task from the work board;
- check out a new branch (with a meaningful name) from the
masterone of the code repository;
- submit a pull request describing what has been done to resolve the claimed story or task;
- elicit code review by other team members;
- resolve comments and integrate relevant requested changes;
- merge the reviewed pull request into
- safely delete the merged branch.
See an example of a big pull request.
The core component of
soweego is iteratively built on machine learning experiments as follows:
- develop a new desired functionality;
- integrate it into the system;
- run performance evaluation over all target catalog datasets;
- report results;
- compare with the prior version of the system;
- decide whether to merge the new functionality or not.
See an experiment example.
Each machine learning algorithm yields different output datasets. Here are the steps adopted to understand what data should be sent to Wikidata and to Mix'n'match:
- evaluate the algorithm over all target catalog input datasets;
- run the algorithm;
- plot the confidence score distribution of each output dataset;
- extract a small sample and manually assess its effectiveness.
See a confidence score distribution example.
Ethical decision makingEdit
- the main issues at stake were:
- posting personal information is harassment;
- this proposal will make people's personal life and data too easy to find;
- social media are not reliable sources;
- in order to address these concerns, the review committee recommended to have an ethic advisor on board;
- in order to cope with the privacy issues summarized above, the development of
soweegowas driven by the following decisions:
- to comply with the Wikidata living people policy;
- to prevent the creation of Wikidata statements with properties that may violate privacy;
- to prevent the creation of Wikidata statements with properties likely to be challenged;
soweegobot must fully respect the guidelines on bot interaction with items for living people;
- to exclusively leverage verified accounts when processing the social medium target catalog Twitter (Q918).
Please provide links to all public, online documents and other artifacts that you created during the course of this project. Even if you have linked to them elsewhere in this report, this section serves as a centralized archive for everything you created during your project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.
soweegocode repository: https://github.com/Wikidata/soweego
- documentation: https://soweego.readthedocs.io/
- Phabricator project (code repository redirect): https://phabricator.wikimedia.org/project/profile/3476/
- Cloud VPS project (production deployment): https://tools.wmflabs.org/openstack-browser/project/soweego
- Toolforge tool: https://toolsadmin.wikimedia.org/tools/id/soweego
- SocialLink code repository: https://github.com/Remper/sociallink/tree/devel
The following resources correspond to the
soweego production deployment output.
Datasets gathered from WikidataEdit
- Discogs (Q504063)
- band training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_discogs_band_training_set.jsonl.gz
- band classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_discogs_band_classification_set.jsonl.gz
- musician training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_discogs_musician_training_set.jsonl.gz
- musician classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_discogs_musician_classification_set.jsonl.gz
- musical work training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_discogs_musical_work_training_set.jsonl.gz
- musical work classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_discogs_musical_work_classification_set.jsonl.gz
- Internet Movie Database (Q37312)
- actor training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_actor_training_set.jsonl.gz
- actor classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_actor_classification_set.jsonl.gz
- director training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_director_training_set.jsonl.gz
- director classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_director_classification_set.jsonl.gz
- musician training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_musician_training_set.jsonl.gz
- musician classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_musician_classification_set.jsonl.gz
- producer training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_producer_training_set.jsonl.gz
- producer classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_producer_classification_set.jsonl.gz
- writer training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_writer_training_set.jsonl.gz
- writer classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_writer_classification_set.jsonl.gz
- audiovisual work training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_audiovisual_work_training_set.jsonl.gz
- audiovisual work classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_imdb_audiovisual_work_classification_set.jsonl.gz
- MusicBrainz (Q14005)
- band training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_musicbrainz_band_training_set.jsonl.gz
- band classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_musicbrainz_band_classification_set.jsonl.gz
- musician training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_musicbrainz_musician_training_set.jsonl.gz
- musician classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_musicbrainz_musician_classification_set.jsonl.gz
- musical work training set: https://tools.wmflabs.org/soweego/wikidata/wikidata_musicbrainz_musical_work_training_set.jsonl.gz
- musical work classification set: https://tools.wmflabs.org/soweego/wikidata/wikidata_musicbrainz_musical_work_classification_set.jsonl.gz
Machine learning modelsEdit
- Discogs (Q504063)
- band: https://tools.wmflabs.org/soweego/models/discogs_band_single_layer_perceptron_model.pkl
- musician: https://tools.wmflabs.org/soweego/models/discogs_musician_single_layer_perceptron_model.pkl
- musical work: https://tools.wmflabs.org/soweego/models/discogs_musical_work_single_layer_perceptron_model.pkl
- Internet Movie Database (Q37312)
- actor: https://tools.wmflabs.org/soweego/models/imdb_actor_single_layer_perceptron_model.pkl
- director: https://tools.wmflabs.org/soweego/models/imdb_director_single_layer_perceptron_model.pkl
- musician: https://tools.wmflabs.org/soweego/models/imdb_musician_single_layer_perceptron_model.pkl
- producer: https://tools.wmflabs.org/soweego/models/imdb_producer_single_layer_perceptron_model.pkl
- writer: https://tools.wmflabs.org/soweego/models/imdb_writer_single_layer_perceptron_model.pkl
- audiovisual work: https://tools.wmflabs.org/soweego/models/imdb_audiovisual_work_single_layer_perceptron_model.pkl
- MusicBrainz (Q14005)
- band: https://tools.wmflabs.org/soweego/models/musicbrainz_band_single_layer_perceptron_model.pkl
- musician: https://tools.wmflabs.org/soweego/models/musicbrainz_musician_single_layer_perceptron_model.pkl
- musical work: https://tools.wmflabs.org/soweego/models/musicbrainz_musical_work_single_layer_perceptron_model.pkl
- Discogs (Q504063)
- band: https://tools.wmflabs.org/soweego/links/discogs_band_single_layer_perceptron_links.csv.gz
- musician: https://tools.wmflabs.org/soweego/links/discogs_musician_single_layer_perceptron_links.csv.gz
- musical work: https://tools.wmflabs.org/soweego/links/discogs_musical_work_single_layer_perceptron_links.csv.gz
- Internet Movie Database (Q37312)
- actor: https://tools.wmflabs.org/soweego/links/imdb_actor_single_layer_perceptron_links.csv.gz
- director: https://tools.wmflabs.org/soweego/links/imdb_director_single_layer_perceptron_links.csv.gz
- musician: https://tools.wmflabs.org/soweego/links/imdb_musician_single_layer_perceptron_links.csv.gz
- producer: https://tools.wmflabs.org/soweego/links/imdb_producer_single_layer_perceptron_links.csv.gz
- writer: https://tools.wmflabs.org/soweego/links/imdb_writer_single_layer_perceptron_links.csv.gz
- audiovisual work: https://tools.wmflabs.org/soweego/links/imdb_audiovisual_work_single_layer_perceptron_links.csv.gz
- MusicBrainz (Q14005)
- band: https://tools.wmflabs.org/soweego/links/musicbrainz_band_single_layer_perceptron_links.csv.gz
- musician: https://tools.wmflabs.org/soweego/links/musicbrainz_musician_single_layer_perceptron_links.csv.gz
- musical work: https://tools.wmflabs.org/soweego/links/musicbrainz_musical_work_single_layer_perceptron_links.csv.gz
- Discogs (Q504063)
- Internet Movie Database (Q37312)
- actor: https://tools.wmflabs.org/soweego/evaluations/imdb_actor_slp_performance.txt
- director: https://tools.wmflabs.org/soweego/evaluations/imdb_director_slp_performance.txt
- musician: https://tools.wmflabs.org/soweego/evaluations/imdb_musician_slp_performance.txt
- producer: https://tools.wmflabs.org/soweego/evaluations/imdb_producer_slp_performance.txt
- writer: https://tools.wmflabs.org/soweego/evaluations/imdb_writer_slp_performance.txt
- audiovisual work: https://tools.wmflabs.org/soweego/evaluations/imdb_audiovisual_work_slp_performance.txt
- MusicBrainz (Q14005)
- B.Sc. thesis slides by MaxFrax96: https://commons.wikimedia.org/wiki/File:Soweego_baseline.pdf
soweego999th commit: https://commons.wikimedia.org/wiki/File:Soweego_999th_commit.png
soweego300th task: https://commons.wikimedia.org/wiki/File:Soweego_300th_task.png
- example of catalogs uploaded to Mix'n'match: https://commons.wikimedia.org/wiki/File:Mix%27n%27match_soweego_catalogs.png
- the deep learning meme: https://commons.wikimedia.org/wiki/
- B.Sc. thesis by MaxFrax96: https://tools.wmflabs.org/soweego/MaxFrax96_BSc_thesis.pdf
- WikiCite 2018 group photo: https://commons.wikimedia.org/wiki/File:Group_photo_-_WikiCite_2018_(02).jpg
- Hjfocs's lightning interview at WikiCite 2018: http://www.openscienceradio.org/2019/01/06/osr134-wikicite-2018-enjoy-the-community-en/?t=14:30
- lunch at Internet Archive (Q461) after WikiCite 2018: https://commons.wikimedia.org/wiki/File:2018-1130-internet-archive-visit-by-wikimedians-04.jpg
- PhD thesis by Remper: http://remper.ru/thesis/v1.4/All.pdf
The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.
What worked wellEdit
What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
What didn’t workEdit
What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.
soweego is a piece of software, and all challenges have revolved around technical aspects:
- the Python programming language is a double-edged sword: it is a perfect fit for simple projects, but it becomes very painful as the complexity grows, and you will miss Java;
- we believe there is no better IDE (Integrated Development Environment) than the IPython console: this looks like a very bad sign;
- code refactoring can be quite error-prone even if you use mature IDEs like PyCharm: it may unexpectedly break your project and will make you waste time;
- there is no standard way for building code documentation: several markup languages and multiple documentation libraries increase the chaos. Tons of ways to do the same thing;
- you are searching a suitable software library for your use case; after some investigation, you find it, and you start playing with it: it seems a perfect fit. Then, the deeper you dig, the more customization you need. You now feel you should have implemented your use case from scratch;
- start small, then get bigger, not viceversa: architectural choices like the early adoption of too complex abstraction libraries for relatively simple scenario have the taste of overengineering.
Next steps and opportunitiesEdit
Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.
- The next short-term step is detailed in the soweego 1.1 rapid grant proposal;
- the crucial growth opportunity centers on the validator component:
- it was initially out of this grant scope;
- we conceived it to address feedback from the Wikidata development team and the Wikimedia research team: "checking Wikidata's data against other databases in order to find mismatches" and "Develop a plan for syncing when external database and Wikidata diverge" respectively;
- first results look promising;
- we are going to apply for a renewal of this grant that focuses on the validator;
soweegois an open-source project: we should allocate energy to build a community around it;
- its impact will increase if we add new catalogs: it will be vital to attract contributors for that task.
Part 2: The GrantEdit
Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.
|Expense||Approved amount||Actual funds spent||Difference|
|Project leader||53,348 €||54,396 €||Balancing based on the Dissemination item.|
|Software architect||20,690 €||20,690 €|
|Dissemination(1)||1,200 €||152 €||WikiCite 2018 was partially covered by a WMF scholarship.|
|Total||75,238 €||75,238 €|
(1) The project leader will also attend WikidataCon 2019, which takes place after the time span of this grant.
Do you have any unspent funds from the grant?
Please answer yes or no. If yes, list the amount you did not use and explain why.
Please answer yes or no. If no, include an explanation.
Confirmation of project statusEdit
Did you comply with the requirements specified by WMF in the grant agreement?
Please answer yes or no.
Is your project completed?
Please answer yes or no.
Yes. We still plan to apply for a renewal of this grant.
We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being a grantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the Project Grant experience? Please share it here!
This project has confirmed that the Wikimedia grantee experience is invaluable and should be regarded as an extraordinary chance: you support open knowledge, you promote open source, you feel part of a community. Do you need anything else?
The most demanding stage happened at the very beginning of this journey: community review. The proposal attracted a large volume of reactions, which touched very distinct topics. We did all we could to undertake every single comment, and to rephrase the project idea appropriately. This probably resulted in a less strong proposal, though. Therefore, I have a few suggestions for the WMF Program Officers that may improve this side of the Project Grants:
- to make a strict distinction between a proposal discussion page and the endorsements section. The former is the ideal place for any kind of constructive feedback, while the latter should be exclusively reserved to short statements of support;
- to detail the community notification section in more depth. In other words, I think it is essential to emphasize that very specific communities will be the best candidates for the most relevant reviews;
- to highlight that grantees are indeed encouraged to integrate community feedback, but are not required to reply to everyone. This message should hold for both grantees and the community.
- see 3rd main bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
- quoting Lydia_Pintscher_(WMDE)'s endorsement in Grants:Project/Hjfocs/soweego#Endorsements: "We are exploring checking Wikidata's data against other databases in order to find mismatches that point to issues in the data. Having more connections via identifiers is a precondition for that."
- d:User:Soweego bot
- Hjfocs and MaxFrax96
- Note that we merged two proposed measures as per LZia (WMF)'s recommendation, see 2nd bullet point in Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
- See first bullet point in Grants_talk:Project/Hjfocs/soweego/Midpoint
- see summary of Grants:Project/Hjfocs/soweego
- See last bullet point in the Additional comments from the Committee section of Grants_talk:Project/Hjfocs/soweego#Aggregated_feedback_from_the_committee_for_soweego