Grants:Project/Hjfocs/soweego/Timeline
This project is funded by a Project Grant
proposal | people | timeline & progress | finances | midpoint report | final report |
Timeline for soweego
editTimeline | Date |
Target databases selection | September 2018 |
Link validator | October 2018 |
Link merger | February 2019 |
Target databases linkers | July 2019 |
Identifiers datasets | July 2019 |
Software package | July 2019 |
Overview
edit- Project start date: July 9, 2018
- Workboard:
https://github.com/Wikidata/soweego/projects/1
- Codebase:
https://github.com/Wikidata/soweego
Monthly updates
editEach update will cover a 1-month time span, starting from the 9th day of the current month. For instance, July 2018 means July 9th to August 8th 2018.
July 2018: target selection & small fishes
editTL;DR:
Mix'n'match
is the tool for small fishes.soweego
will not handle them.
The very first task of this project is to select the target databases.[1] We see two directions here: either we focus on a few big and well known targets as per the project proposal, or we can try to find a technique to link a lot of small ones from the long tail, as suggested by ChristianKl[2] (thanks for the precious feedback!).
We used SQID as a starting point to get a list of people databases that are already used in Wikidata, sorted in descending order of usage.[3] This is useful to split the candidates into big and small fishes, namely the head and the (long) tail of the result list respectively. Let's start with the small fishes.
Quoting ChristianKl, it would be ideal to create a configurable tool that enables users to add links to new databases in a reasonable timeframe. Consequently, we carried out the following investigation: we considered as small fishes all the entries in SQID with an external ID datatype, used for class human (Q5), and with less than 15 uses in statements. We detail below a set of critical issues about this direction, as well as their eventual solutions.
The analysis of a small fish can be broken down into a set of steps. This is also useful to translate the process into software and to make each step flexible enough for dealing with the heterogeneity of the long tail targets. The steps have been implemented into a piece of software by MaxFrax96.[4]
Retrieving the dump
editThis sounds pretty self-evident: if we aim at linking two databases, then we need access to all their entities. Since we focus on people, it is therefore necessary to download the appropriate dump for each small fish we consider.
Problem
In the real world, such a trivial step raises a first critical issue: not all the database websites give us the chance to download the dump.
Solutions
- Cheap, but not scalable: to contact the database administrator and discuss dump releases for Wikidata;
- expensive, but possibly scalable: to autonomously build the dump. If a valid URI exists for each entity, we can re-create the dump. However, this is not trivial to generalize: sometimes it is impossible to retrieve the list of entities, sometimes the URIs are merely HTML pages that require Web scraping. See the following examples:
- Welsh Rugby Union men's player ID (P3826) needs scraping for both the list of entities and each entity;
- Berlinische Galerie artist ID (P4580) needs scraping for both the list of entities and each entity;
- FAI ID (P4556) needs scraping for both the list of entities and each entity;
- Debrett's People of Today ID (P2255) does not seems to expose any list of people;
- AGORHA event ID (P2345) does not seem to expose any list of people.
Handling the format
editThe long tail is roughly broken down as follows:
- XML;
- JSON;
- RDF;
- HTML pages with styling and whatever a Web page can contain.
Problem
Formats are heterogenous.
We focus on open data and RDF, as dealing with custom APIs is out of scope for this investigation.
We also hope that the open data trend of recent years would help us.
However, a manual scan of the small fishes yielded poor results: out of 16 randomly picked candidates, only YCBA agent ID (P4169) was in RDF, and has thousands of uses in statements at the time of writing this report.
Solution
To define a way (by scripting for instance) to translate each input format into a standard project-wide one.
This could be achieved during the next step, namely ontology mapping between a given small fish and Wikidata.
Mapping to Wikidata
editLinking Wikidata items to target entities requires a mapping between both metadata/schemas.
Solution
The mapping can be manually defined by the community: a piece of software will then apply it.
To implement this step, we also need the common data format described above.
Side note: available entity metadata
Small fishes may contain entity metadata which are likely to be useful for automatic matching.
The entity linking process may dramatically improve if the system is able to mine extra property mappings.
This is obvious when metadata are in different languages, but in general we cannot be sure that two different databases hold the same set of properties, if they have some in common.
Conclusion
editIt is out of scope for the project to perform entity linking over the whole set of small fishes. On the other hand, it may make sense to build a system that lets the community plug in new small small fishes with relative ease. Nevertheless, this would require a reshape of the original proposal, which comes with its own risks:
- it is probably not a safe investment of resources;
- eventual results would not be in the short term, as they would require a lot of work to create a flexible system for everybody's needs;
- it is likely that the team is not facing eventual extra problems in this phase.
Mix'n'match
editMost importantly, a system to plug new small fishes already exists. Mix'n'match[5] is specifically designed for the task.[6] Instead of reinventing the wheel, we will join efforts with our advisor Magnus Manske in his work on big fishes.[7]
August 2018: big fishes selection
editTL;DR:
- the
soweego
team selected 4 candidate targets:
BIBSYS (Q4584301). Coverage = 21%discarded, see #September 2018- Discogs (Q504063). Coverage = 33%
- Internet Movie Database (Q37312). Coverage = 42%
- Q14005 (Q14005). Coverage = 35%
- X (Q918). Coverage = 31%
soweego
team will join efforts with Magnus Manske's work on large catalogs.Motivation #1: target investigation
editThe following table displays the result of our investigation on candidate big fishes. We computed the Wikidata item counts as follows.
- Wikidata item count queries on specific classes:
- Wikidata link count queries: use the property for identifiers:
Resource | # entries | Reference to # entries | Dump download URL | Online access (e.g., SPARQL) | # Wikidata items with link / without link | Available metadata | Links to other sources | In mix'n'match | TL;DR: | Candidate? |
---|---|---|---|---|---|---|---|---|---|---|
[12] | 7,037,189 | [13] | [14] | SRU: [15], OAI-PHM: [16] | humans: 571,357 / 3,931,587; authors: 168,188 / 326,989 | id, context, preferredName, surname, forename, describedBy, type, dateOfBirth, dateOfDeath, sameAs | GND, BNF, LoC, VIAF, ISNI, English Wikipedia, Wikidata | Yes (large catalogs) | Already processed by Mix'n'match large catalogs, see [17] | Oppose |
[18] | > 8 millions | names authority file [19] | [20] | Not found | humans: 581,397 / 3,921,547; authors: 204,813 / 290,364 | URI, Instance Of, Scheme Membership(s), Collection Membership(s), Fuller Name, Variants, Additional Information, Birth Date, Has Affiliation, Descriptor, Birth Place, Associated Locale, Birth Place, Gender, Associated Language, Field of Activity, Occupation, Related Terms, Exact Matching Concepts from Other Schemes, Sources | Not found | [21] | Already well represented in Wikidata, low impact expected | Oppose |
[22] | 8,738,217 | [23] | [24] | Not found | actors, directors, producers: 197,626 / 104,392 | name, birth year, death year, profession, movies | Not found | No | Metadata allows to run easy yet effective matching strategies; the license can be used for linking, see [25]; quite well represented in Wikidata (2/3 of the relevant subset) | Support |
[26] | 2,181,744 | authors, found in home page | [27] | SPARQL: [28] | humans: 356,126 / 4,146,818; authors: 148,758 / 346,419 | country, language, variants of name, pages in data.bnf.fr, sources and references | LoC, GND, VIAF, IdRef, Geonames, Agrovoc, Thesaurus W | Yes (large catalogs) | Seems well shaped; already processed by Mix'n'match large catalogs, see [29] | Oppose |
[30] | about 1.5 M | dataset described at [31] | [32] | SPARQL | humans: 94,009 / 4,408,935; authors: 40,656 / 454,521 | depend on the links found for the ID | VIAF, GND | [33] | Underrepresented in Wikidata, small subset (47k entries) in Mix'n'match, of which 67% is unmatched, high impact expected | Strong support |
[34] | About 500 k | Search for a,b,c,d... in the search window | Not found | SOLR | humans: 378,261 / 4,124,683; authors: 153,024 / 342,153 | name, language, nationality, notes | [35], [36] | No | Discarded: no dump available | Oppose |
[37] | 417 k | [38] | [39] | API | humans: 303,235 / 4,199,709; authors: 4,966 / 490,211 | PersonId, EngName, ChName, IndexYear, Gender, YearBirth, DynastyBirth, EraBirth, EraYearBirth, YearDeath, DynastyDeath, EraDeath, EraYearDeath, YearsLived, Dynasty, JunWang, Notes, PersonSources, PersonAliases, PersonAddresses, PersonEntryInfo, PersonPostings, PersonSocialStatus, PersonKinshipInfo, PersonSocialAssociation, PersonTexts | list of external sources: [40] | [41] | The database is in a proprietary format (Microsoft Access) | Weak support |
[42] | About 500 k | found as per [43] | [44] | API, SRU | humans: 274,574 / 4,228,370; authors: 92,662 / 402,515 | name, birth, death, identifier, SKOS preferred label, SKOS alternative labels | VIAF, GeoNames, Wikipedia, LoC Subject Headings | [45] | Lots of links to 3 external databases, but few metadata; seems to be the same as BNF. | Oppose |
[46] | 1,393,817 | [47] | [48] | API | humans: 98,115 / 4,404,829; musicians & bands: 114,798 / 190,599 | URI, Type, Gender, Born, Born in, Died, Died in, Area, IPI code, ISNI code, Rating, Wikipedia bio, Name, Discography, Annotation, Releases, Recordings, Works, Events, Relationships, Aliases, Tags, Detail | [49], [50], [51], [52], [53], [54], [55], [56], VIAF, Wikidata, English Wikipedia, YouTube/Vevo, [57], [58], resource official web page, [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75] | No | High quality data, plenty of external links, totally open source, regular dump releases | Strong support |
[76] | About 1 M | People found as per [77], with restriction to people | Not found | API | humans: 128,536 / 4,374,408; authors: 42,991 / 452,186 | birth, death, nationality, language, archival resources, realated resources, related external links, ark ID, SNAC ID | [78], [79] | [80] | No dump available, 99.9% already matched in Mix'n'match | Oppose |
[81] | 840,883 | [82] | [83] | API | humans: 167,663 / 4,335,281; authors: 410 / 494,767 | name, country, keywords, other IDs, education, employment, works | [84], [85] | Yes (large catalogs) | Already processed by Mix'n'match large catalogs, see [86] | Oppose |
[87] | 6,921,024 | [88] | [89] | API | humans: 140,883 / 4,362,061; authors: 58,823 / 436,354 | name, birth year, death year | Not found | [90] | Only names, birth and death dates; no dedicated pages for people entries; source code: [91] | Neutral |
[92] | Not found | Not found | Not found | Not found | Not found | Not found | Not found | [93] | Seems closed. The dataset providers claimed they would publish a new site, not happened so far [94] | Oppose |
[95] | 336 M active | [96] | Not found | API | humans: 85,527 / 4,417,417 | verified account, user name, screen name, description, location, followers, friends, tweets, listed, favorites, statuses | Plenty | No | No official dump available, but the team has collected the dataset of verified accounts. Links stem from home page URLs, should be filtered according to a white list. Underrepresented in Wikidata, high impact expected | Strong support |
[97] | Not found | Not found | Not found | Not found | Not found | Not found | Not found | No | Seems it does not contain people data, only books by author | Oppose |
[98] | 20,255 | [99] | [100] | SPARQL | humans: 81,455 / 4,421,489; authors: 34,496 / 460,681 | Subject line, Homepage ID, Synonym, Broader term, Hyponym, Related words, NOTE Classification symbol (NDLC), Classification symbol (NDC 9), Classification symbol (NDC 10), Reference (LCSH), Reference (BSH 4), Source (BSH 4), Source edit history, Created date, last updated | VIAF, LoC | No | Mismatch between the actual dataset and the links in Wikidata; it extensively refers to VIAF and LoC (see [101], entry 12 of the table and [102]) | Oppose |
[103] | Not found | Not found | API | humans: 97,599 / 4,405,345; authors: 28,404 / 466,773 | Not found | Not found | Not found | No dump available | Oppose | |
[104] | 5,736,280 | [105] | [106] | API + Python client | humans: 66,185 / 4,436,759; musicians & bands: 78,522 / 226,875 | artist name, real name, short bio, aliases, releases, band membership | Plenty, top-5 frequent: [107], [108], [109], [110], [111] | [112] | CC0 license, 92% not matched in Mix'n'match, high impact expected | Strong support |
[113] | 1,269,331 | [114] | [115], [116] | SPARQL | humans: 106,839 / 4,396,105; authors: 50,251 / 444,926 | name, bio | [117], [118], [119], [120], [121], [122], [123], [124], [125] | [126] | Outdated dump (2013), low quality data, 75% not matched in Mix'n'match | Oppose |
Motivation #2: coverage estimation
editWe computed coverage estimations over Strong support and Support candidates to assess their impact on Wikidata, as suggested by Nemo bis[8] (thanks for the valuable comment!). In a nutshell, coverage means how many existing Wikidata items could be linked.
For each candidate, the estimation procedure is as follows.
- pick a representative 1% sample of Wikidata items with no identifier for the candidate. Representative means e.g., musicians for Q14005 (Q14005): it would not make sense to link generic people to a catalog of musical artists;
- implement a matching strategy:
- perfect = perfect matches on names, sitelinks, external links;
- similar = matches on names and external links based on tokenization and stopwords removal;
- SocialLink, as per our approach;[9]
- compute the percentage of matched items with respect to the sample.
The table below shows the result. It is worth to observe that similar coverage percentages correspond to different matching strategies: this may suggest that each candidate may require different algorithms to achieve the same goal. Our hypothesis is that higher data quality entails simpler solutions: for instance, Q14005 (Q14005) seems like a well structured catalog, thus the simplest strategy being sufficient.
Target | Sample | Matching strategy | # matches | % coverage |
---|---|---|---|---|
BIBSYS (Q4584301) | 4,249 authors and teachers | Perfect | 899 | 21% |
Discogs (Q504063) | 1,253 musicians | Perfect & similar | 414 | 33%(1) |
Internet Movie Database (Q37312) | 1,022 actors, directors and producers | Perfect | 432 | 42% |
Q14005 (Q14005) | 1,100 musicians | Perfect | 388 | 35% |
X (Q918) | 15,565 living humans | SocialLink | 4,867(2) | 31% |
(1) using perfect matching strategy only: 4.6%
(2) out of which 609 are confident matches
September 2018
editWe manually assessed small subsets of the matches obtained after the coverage estimations. Given the scores and the evaluation, we decided to discard BIBSYS (Q4584301). The main reasons besides the mere score follow.
- The dump is not synchronized with the online data;
- identifiers in the dump may not exist online:
- cross-catalog links in the dump may not be the same as online;
- the dump suffers from inconsistency:
- the same identifier may have multiple links, thus flawing the link-based matching strategy;
- links from different catalogs may have different quality, e.g., one may be correct, the other not;
- online data can also be inconsistent. A match may be correct, but the online identifier may have a wrong cross-catalog link.
We report below a first round of evaluation that estimates the performance of already implemented matchers over the target catalogs. Note that Q14005 (Q14005) was evaluated more extensively thanks to MaxFrax96's thesis work.
Target | Matching strategy | # samples | Precision |
---|---|---|---|
BIBSYS (Q4584301) | Perfect links | 10 | 50% |
Discogs (Q504063) | Similar links | 10 | 90% |
Discogs (Q504063) | Similar names | 32 | 97% |
Internet Movie Database (Q37312) | Perfect names | 10 | 70% |
Q14005 (Q14005) | Perfect names | 38 | 84% |
Q14005 (Q14005) | Perfect names + dates | 32 | 100% |
Q14005 (Q14005) | Similar names | 24 | 71% |
Q14005 (Q14005) | Perfect links | 71 | 100% |
Q14005 (Q14005) | Similar links | 102 | 99% |
X (Q918) | SocialLink | 67 | 91% |
Technical
edit- Baseline matchers finalized;
- d:User:Soweego_bot account created;
- request for the bot flag approved: d:Wikidata:Requests_for_permissions/Bot/soweego_bot;
- added first set of identifier statements from baseline matchers;
- started work on Internet Movie Database (Q37312), computed coverage estimation;
- Validator component:
- delete invalid statements not complying with criterion 1;
- first working version of validation criterion 2 (links).
Dissemination
edit- MaxFrax96 successfully defended his bachelor thesis[10] on soweego. Congratulations!
- Lc_fd joined the project as a volunteer developer. Welcome!
October 2018
editDuring this month, the team devoted itself to software development, with tasks broken down as follows.
Application package
editThis is how the software is expected to ship. Tasks:
- packaged soweego in 2 Docker containers:
- test launches a local database instance to enable work on a target catalog dump extraction and import;
- production feeds the shared Toolforge large catalogs database;
- let a running container see live changes in the code.
Validator
editThis component is responsible for monitoring the divergence between Wikidata and a given target catalog. It implements bullet point 3 of the project review committee recommendations[11] and performs validation of Wikidata content based on 3 main criteria:[12]
- existence of target identifiers;
- agreement with the target on third-party links;
- agreement with the target on "stable" metadata.
Tasks:
- existence-based validation (criterion 1):
- first run over Q14005 (Q14005);
- gather target catalog data through database queries;
- full implementation of the link-based validation (criterion 2).
Importer
editThis component extracts a given target catalog data dump, cleans it and imports it in Magnus_Manske's database on Toolforge. It follows ChristianKl's suggestion[13] and is designed as a general-purpose facility for the developer community to import new target catalogs. Tasks:
- worked on Q14005 (Q14005):
- split dump into musicians and bands;
- extraction and import of musicians and bands;
- extraction and import of links.
Ingestor
editThis component is a Wikidata bot that uploads the linker and validator output. Tasks:
- deprecate identifier statements not passing validation;
- handle statements to be added: if the statement already exists in Wikidata, just add a reference node.
Utilities
edit- Complete URL validation: pre-processing, syntax parsing and resolution;
- URL tokenization;
- text tokenization;
- match URLs known by Wikidata as external identifiers and convert them accordingly.
November 2018
editThe team focused on the importer and linker modules.
Importer
edit- Worked on Discogs (Q504063):
- split dump into musicians and bands;
- extraction and import of musicians and bands;
- extraction, validation and import of links;
- extraction and import of textual data.
- major effort on building full-text indices on the Toolforge database:
- the Python library we use does not natively support them;
- investigated alternative solutions, i.e., https://github.com/mengzhuo/sqlalchemy-fulltext-search;
- managed to implement them from the initial library;
- Refinements for Q14005 (Q14005).
Linker
edit- Added baseline strategies to the importer workflow. They now consume input from the Toolforge database;
- adapted and improved the similar name strategy, which leverages full-text indices on the Toolforge database;
- preparations for the baseline datasets.
Validator
edit- First working version of the metadata-based validation (criterion 3).
Ingestor
edit- Add referenced third-party external identifier statements from link-based validation;
- add referenced described at URL (P973) statements from link-based validation;
- add referenced statements from metadata-based validation.
Dissemination
edit- The project leader attended WikiCite 2018, see WikiCite 2018#Attendees;
- joined the SPARQL jam session during day 3, discussed with Fuzheado alternative methods to slice Wikidata subsets via SPARQL;[14]
- new connections: Miriam_(WMF) from WMF research team, Jkatz_(WMF) from WMF readers department, Giovanni from Turing Institute UK, Susannaanas from Open Knowledge Finland, Michelleif from Stanford University;
- synchronized with Tpt, T_Arrow, Adam_Shorland_(WMDE), Smalyshev_(WMF), Maxlath, LZia_(WMF), Dario_(WMF), Denny, Sannita.
December 2018
editWe focused on 2 key activities:
- research on probabilistic record linkage;[15]
- packaging of the complete soweego pipeline.
Probabilistic record linkage
editDeterministic approaches are rule-based linking strategies. and represent a reasonable baseline. On the other hand, probabilistic ones leverage machine learning algorithms and are known to perform effectively.[16] Therefore, we expect our baseline to serve as the set of features for probabilistic methods.
- First exploration and hands on the recordlinkage library:[17]
- understood how the library applies the general workflow: cleaning, indexing, comparison, classification, evaluation;
- published a report that details the required implementation steps;[18]
- started the first probabilistic linkage experiment, i.e., using the naïve Bayes algorithm;[19]
- recordlinkage extensively employs DataFrame objects from the well known pandas Python library:[20] investigation and hands on it;
- started work on the training set building:
- gathered the Wikidata training set live from the Web API;
- gathered the target training set from the Toolforge database;
- converted both to suitable pandas dataframes;
- custom implementation of the cleaning step;
- indexing implemented as blocking on the target identifiers;
- started work on feature extraction.
Pipeline packaging
edit- Finalized work on full-text indices on the Toolforge database;
- adapted perfect name and similar link baseline strategies to work against the Toolforge database;
- built an utility to retrive mappings between Toolforge database tables and SQL Alchemy entities;
- completed the similar link baseline strategy;
- linking based on edit distances now works with SQL Alchemy full-text indices;
- baseline linking can now run from the command line interface;
- various Docker improvements:
- set up volumes in the production instance;
- allow custom configuration in the test instance;
- set up the execution of all the steps for the final pipeline.
January 2019
editHappy new year! We are pleased to announce a new member of the team: Tupini07. Welcome on board! Tupini07 will work on the linker for Internet Movie Database (Q37312). The development activities follow.
IMDb
edit- Clustered professions related to music;
- reached out to IMDb licensing department;
- understood how the miscellaneous profession is used in the catalog dump.
Probabilistic linker
edit- Investigated Naïve Bayes classification in the recordlinkage Python library;
- worked on feature extraction;
- grasped performance evaluation in the recordlinkage Python library;
- completed the Naïve Bayes linker experiment;
- engineered the vector space model feature;
- gathered Wikidata aliases for dataset building;
- discussed how to handle feature extraction in different languages.
Baseline linker
edit- Assessed similar URLs link results;
- piped linker output to the ingestor;
- read input data from the Wikidata live stream;
- worked on birth and death dates linking strategy.
Importer
edit- Imported Q14005 (Q14005) URLs;
- added support for multiple dump files;
- extracted ISNI code from Q14005 (Q14005) artist attributes.
Package
edit- Installed less in Docker;
- set up the final pipeline as a Docker container;
- hitting a segmentation fault when training in Docker container on a specific machine;
- improved Docker configuration.
February 2019
editThe team fully concentrated on software development, with a special focus on the probabilistic linkers.
Probabilistic linker
edit- Handled missing data from Wikidata and the target;
- parsed dates at preprocessing time;
- started work on blocking via queries against the target full-text index;
- understood custom blocking logic in the recordlinkage Python library;
- resolved object QIDs;
- dropped data columns containing missing values only;
- included negative samples in the training set;
- enabled dump of evaluation predictions to CSV;
- started work on scaling up the whole probabilistic pipeline:
- implemented chunk processing techniques;
- parallelized feature extraction;
- avoided redundant input/output operations on files when gathering target datasets.
Importer
edit- The IMDb importer is ready;
- handled connection issues with the target database engine;
- made the expensive URL resolution functionality optional;
- fixed a problem causing the MusicBrainz import to fail;
- improved logging of the MusicBrainz dump extractor;
- added batch insert functionality;
- added import progress tracker;
- extra logging for the Discogs dump extractor;
- enabled bulk insertion of the SQL Alchemy Python library;
Ingestor
edit- Uploaded a 1% sample of the Twitter linker to Wikidata;
- filtered the dataset on confident links;
- resolved Twitter UIDs against usernames.
March 2019
editThis was a crucial month. In a nutshell:
- the probabilistic linker workflow is in place;
- we successfully ran it over complete imports of the target catalogs;
- we uploaded samples of the linkers that performed best to Wikidata;
- we produced the following evaluation reports for the Naïve Bayes (NB) and Support Vector Machines (SVM) algorithms.
- Discogs NB: https://github.com/Wikidata/soweego/issues/171#issuecomment-476293971;
- Discogs SVM: https://github.com/Wikidata/soweego/issues/171#issuecomment-477978766;
- IMDb NB: https://github.com/Wikidata/soweego/issues/204#issuecomment-477964956;
- IMDb SVM: https://github.com/Wikidata/soweego/issues/204#issuecomment-478038363
- MusicBrainz NB: https://github.com/Wikidata/soweego/issues/203#issuecomment-477558282
- MusicBrainz SVM: https://github.com/Wikidata/soweego/issues/203#issuecomment-478015882
Linker
edit- Removed empty tokens from full-text index query;
- prevented positive samples indices from being empty;
- implemented a feature for dates;
- implemented a feature based on similar names;
- implemented a feature for occupations:
- gathered specific statements from Wikidata;
- ensured that occupation statements are only gathered when needed;
- enabled comparison in the whole occupation classes tree;
- handled missing target data;
- avoided computing features when Wikidata or target DataFrame columns are not available;
- built blocking via full-text query over the whole Wikidata dataset;
- built full index of positive samples;
- simplified probabilistic workflow;
- checked whether relevant Wikidata or target DataFrame columns exist before adding the corresponding feature;
- k-fold evaluation;
- ensured to pick a model file based on the supported classifiers;
- filtered duplicate predictions;
- first working version of the SVM linker;
- avoided stringifying list values.
Importer
edit- Fixed an issue that caused the failure of the IMDb import pipeline;
- parallelized URL validation;
- prevented the import of unnecessary occupations in IMDb;
- occupations that are already expressed on the import table name do not get imported;
- decompressed Discogs and MusicBrainz dumps are now deleted after a successful import;
- avoided populating tables when a MusicBrainz entity type is unknown.
Miscellanea
edit- Optimized full-text index queries;
- the perfect name match baseline now runs bulk queries;
- set up the Wikidata API login with the project bot;
- progress bars do not disappear anymore.
April 2019
editAfter introducing 2 machine learning algorithms, i.e., naïve Bayes and support vector machines, this month we brought neural networks into focus.
The major outcome is a complete run of all linkers over all whole datasets, and made the evaluation results available at https://github.com/Wikidata/soweego/wiki/Linkers-evaluation.
Linker
edit- Decided not to handle QIDs with multiple positive samples;
- added feature that captures full names;
- added an optional post-classification rule that filters out matches with different names;
- injected SVM linker based on
libsvm
, instead ofliblinear
. This allows to use non-linear kernels at the cost of higher training time; - first implementation of a single-layer perceptron;
- added a set of
Keras
callbacks; - ensured a training/validation set split when training neural networks;
- incorporated early stopping at training time of neural networks;
- implemented a rule to enforce links of MusicBrainz entities that already have a Wikidata URL;
- built a stopword list for bands;
- enabled cache of complete training & classification sets for faster prototyping;
- constructed facilities for hyperparameters tuning through grid search, available at evaluation and optionally at training time;
- experimented architectures of multi-layer perceptrons.
Importer
edit- Fixed a misleading log message when importing MusicBrainz relationships.
May 2019
editThis month was pretty packed. The team revolved around 3 main activities:
- development of new linkers for musical and audiovisual works;
- refactoring & documentation of the code base;
- facility to upload medium-confidence results to the Mix'n'match tool.
New linkers
edit- imported Discogs masters;
- imported IMDb titles;
- imported MusicBrainz releases;
- implemented the musical work linker;
- implemented the audiovisual work linker.
Refactor & document
edit- Code style:
- refactored & documented the pipeline;
- refactored & documented the importer module;
- refactored & documented the ingestor module.
Mix'n'match client
edit- Interacted with the project advisor, who is also the maintainer of the tool;
- added ORM entities for mix'n'match
catalog
andentry
DB tables.
Linker
edit- Added string kernels as a feature for names;
- completed the multi-layer perceptron;
- handled too many concurrent SPARQL queries being sent when gathering the tree of occupation QIDs;
- fixed parallelized full-text blocking, which made IMDb crash.
Importer
edit- Avoid populating DB tables when a Discogs entity type is unknown.
Ingestor
edit- Populate statements connecting works with people.
Continuous integration
edit- Set up
Travis
;[24] - added build badge to the README;
- let Travis push formatted code.
June 2019
editThe final month was totally devoted to 5 major tasks:
- deployment of soweego in production;
- upload of results;
- documentation;
- code style;
- refactoring.
Production deployment
edit- Set up the production-ready Wikimedia Cloud VPS machine;[25]
- dry-ran and monitored production-ready pipelines for each target catalog;
- structured the output folder tree;
- decided confidence score thresholds;
- the pipeline script now backs up the output folder of the previous run;
- avoided interactive login to the Wikidata Web API;
- enabled the extraction of Wikidata URLs available in target catalogs;
- set up scripts for cron jobs.
Results upload
edit- Confident links (i.e., with score above 0.8) are being uploaded to Wikidata via d:User:Soweego bot;
- medium-confidence (i.e., with score between 0.5 and 0.8) links are being uploaded to Mix'n'match for curation by the community.
Documentation
edit- Added Sphinx-compliant[26] documentation strings to all public functions and classes;
- complied with PEP 257[27] and PEP 287;[28]
- converted and uplifted relevant pages of the GitHub Wiki into Python documentation;
- customized the look of the documentation theme;
- deployed the documentation to Read the Docs;[29]
- completed the validator module;
- completed the Wikidata module;
- completed the linker module: this activity required extra efforts, since it is soweego's core;
- main command line documentation;
- full README.
Code style
edit- Complied with PEP 8[30] and Wikimedia[31] conventions;
- added type hints[32] to public function signatures.
Refactoring
edit- fixed
pylint
errors and relevant warnings; - reduced code complexity;
- applied relevant
pylint
refactoring suggestions.
- ↑ Grants:Project/Hjfocs/soweego#Work_package
- ↑ Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability
- ↑
Select datatype
set toExternalId
,Used for class
set tohuman Q5
- ↑ https://github.com/MaxFrax/Evaluation
- ↑ https://tools.wmflabs.org/mix-n-match/
- ↑ http://magnusmanske.de/wordpress/?p=471
- ↑ http://magnusmanske.de/wordpress/?p=478
- ↑ Grants_talk:Project/Hjfocs/soweego#Coverage_statistics
- ↑ https://iswc2017.semanticweb.org/wp-content/uploads/papers/MainProceedings/441.pdf
- ↑ https://tools.wmflabs.org/soweego/MaxFrax96_BSc_thesis.pdf
- ↑ Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
- ↑ https://github.com/Wikidata/soweego/issues/19#issuecomment-413622924
- ↑ Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability
- ↑ https://etherpad.wikimedia.org/p/WikiCite18Day3sparql
- ↑ en:Record_linkage
- ↑ http://axon.cs.byu.edu/~randy/pubs/wilson.ijcnn2011.beyondprl.pdf
- ↑ https://recordlinkage.readthedocs.io
- ↑ https://github.com/Wikidata/soweego/wiki/Notes-on-the-recordlinkage-Python-library
- ↑ https://github.com/Wikidata/soweego/issues/146
- ↑ https://pandas.pydata.org/
- ↑ https://black.readthedocs.io/
- ↑ https://pypi.org/project/autoflake/
- ↑ https://pylint.readthedocs.io/
- ↑ https://travis-ci.com/
- ↑ https://tools.wmflabs.org/openstack-browser/project/soweego
- ↑ https://www.sphinx-doc.org/
- ↑ https://www.python.org/dev/peps/pep-0257/
- ↑ https://www.python.org/dev/peps/pep-0287/
- ↑ https://soweego.readthedocs.io/
- ↑ https://www.python.org/dev/peps/pep-0008/
- ↑ https://www.mediawiki.org/wiki/Manual:Coding_conventions/Python
- ↑ https://docs.python.org/3/library/typing.html