GlobalFactSync (GFS) News and CrossWiki Feedback Squad!

What's the GFS Feedback Squad

edit
  • Main purpose is to send pings across all wikis to you, although you are listed as a volunteer, there is no obligation.
  • If you have feedback or questions, please leave them on the project talk page or gfs@infai.org
  • We might also occasionally include short lists of tasks that could be done by the Squad - members are encouraged to pick tasks that they feel suited to perform and then mark them as done once the task is completed.

How to join

edit
  • Subscribe and get pinged on GFS updates by clicking the join button on our project proposal page
  • or watch-list this page
  • Unsubscribe by editing and removing your account from {{Probox |volunteer= on the proposal page

News and reports

edit

Pinging all previous contributors

edit

Dear @BrillLyle:, @Crazy1880:, @Danscott:, @DonaldTrung:, @Ehrlich91:, @GerardM:, @He7d3r:, @Jimkont:, @Juliaholze:, @Jura1:, @KirilSimeonovski:, @KrzysztofWecel:, @Lewoniewski:, @M1ci:, @Metrónomo:, @Mgns:, @MikePeel:, @Moebeus:, @Multichill:, @Pine:, @Pintoch:, @Rachmat04:, @Sabas88:, @SebastianHellmann:, @S.karampatakis:, @Slowking4:, @TomT0m:, @VojtěchDostál:, @XblackX:, @YULdigitalpreservation:, @Sj:, @Nicolastorzec:, @ChristorangeCA:, @Lewoniewski:, @Vladimir Alexiev:, @Lambdamusic:, @Sannita:, @Digitaleffie:, @Lydia Pintscher (WMDE):, @YamanBeyza:, @White gecko:, @Tina Schmeissner:, @KrzysztofWecel:, @Mrvnhfr:, @JohannesFre:,

Thank you all for your feedback during the proposal phase. We established a cross-wiki ping channel for news on the GlobalFactSync Project. While we took your accounts from the endorsement and talk section, this will be the only ping to notify you once. Further pings will only be sent to the people who sign up. Please have a look at the kick-off note and its update below. SebastianHellmann (talk) 15:32, 15 August 2019 (UTC)

User Script, Data Browser, Reference web service (15. August 2019)

edit

After the Kick-Off note end of July, which described our first edit and the concept better, we shaped the technical microservices and data into more concise tools that are easier to use and demo during our Wikimania presentation:

  1. User Script available at User:JohannesFre/global.js shows links from each article and Wikidata to the Data Browser and Reference Web Service
     
    User Script Linking to the GFS Data Browser
  2. GFS Data Browser Github now accepts any URI in subject from Wikipedia, DBpedia or Wikidata, see the Boys Don't Cry example from Kick-Off Note, Berlin/Geo-coords lat long, Albert Einstein's Religion. Not Live yet, edits/fixes are not reflected
  3. Reference Web Service (Albert Einstein: http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json&dbpedia) extracts (1) all references from a Wikipedia page, (2) matched to the infobox parameter and (3) also extracts the fact from it. The service will remain stable, so you can use it.

Furthermore, we are designing a friendly fork of HarvestTemplates to effectively import all that data into Wikidata.


Kick-off note (25. Juli 2019)

edit


GlobalFactSync - Synchronizing Wikidata and Wikipedia's infoboxes


How is data edited in Wikipedia/Wikidata? Where does it come from? And how can we synchronize it globally?

The GlobalFactSync (GFS) Project — funded by the Wikimedia Foundation — started in June 2019 and has two goals:

  • Answer the above-mentioned three questions.
  • Build an information system to synchronize facts between all Wikipedia language-editions and Wikidata.

Now we are seven weeks into the project (10+ more months to go) and we are releasing our first prototypes to gather feedback.


How – Synchronization vs Consensus

We follow an absolute Human(s)-in-the-loop approach when we talk about synchronization. The final decision whether to synchronize a value or not should rest with a human editor who understands consensus and the implications. There will be no automatic imports. Our focus is to drastically reduce the time to research all references for individual facts.

A trivial example is the release date of the single “Boys Don’t Cry” (March 16th, 1989) in the English, Japanese, and French Wikipedia, Wikidata and finally in the external open database MusicBrainz. A human editor might need 15-30 minutes finding and opening all different sources, while our current prototype can spot differences and display them in 5 seconds.

We already had our first successful edit where a Wikipedia editor fixed the discrepancy with our prototype: “I’ve updated Wikidata so that all five sources are in agreement.” We are now working on the following tasks:

  • Scaling the system to all infoboxes, Wikidata and selected external databases (see below on the difficulties there)
  • Making the system:
    • “live” without stale information
    • “reliable” with less technical errors when extracting and indexing data
    • “better referenced” by not only synchronizing facts but also references


Contributions and Feedback

To ensure that GlobalFactSync will serve and help the Wikiverse we encourage everyone to try our data and micro-services and leave us some feedback, either on our Meta-Wiki page or via gfs@infai.org. In the following 10+ months, we intend to improve and build upon these initial results. At the same time, these microservices are available to every developer to exploit it and hack useful applications. The most promising contributions will be rewarded and receive the book “Engineering Agile Big-Data Systems”. Please post feedback or any tool or GUI here. In case you need changes to be made to the API, please let us know, too. For the ambitious future developers among you, we have some budget left that we will dedicate to an internship. In order to apply, just mention it in your feedback post.

Finally, to talk to us and other GlobalfactSync-Users you may want to visit WikidataCon and Wikimania, where we will present the latest developments and the progress of our project.


Data, APIs & Microservices (Technical prototypes)

Data Processing and Infobox Extraction:

For GlobalFactSync we use data from Wikipedia infoboxes of different languages, as well as Wikidata, and DBpedia and fuse them to receive one big, consolidated dataset – a PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper. One of our next steps is to integrate MusicBrainz into this process as an external dataset. We hope to implement even more such external datasets to increase the amount of available information and references.


First microservices:

We deployed a set of microservices to show the current state of our toolchain.

  • [Initial User Interface] The GFS Data Browser is our GlobalFactSync UI prototype (available at http://global.dbpedia.org) which shows all extracted information available for one entity for different sources. It can be used to analyze the factual consensus between different Wikipedia articles for the same thing. Example: Look at the variety of population counts for Grimma.
  • [Reference Data Download] We ran the Reference Extraction Service over 10 Wikipedia languages. Download dumps here.
  • [ID service] Last but not least, we offer the Global ID Resolution Service. It ties together all available identifiers for one thing (i.e. at the moment all DBpedia/Wikipedia and Wikidata identifiers – MusicBrainz coming soon…) and shows their stable DBpedia Global ID.


Finding sync targets

In order to test out our algorithms, we started by looking at various groups of subjects, our so-called sync targets. Based on the different subjects a set of problems were identified with varying layers of complexity:

  • identity check/check for ambiguity — Are we talking about the same entity?
  • fixed vs. varying property — Some properties vary depending on nationality (e.g., release dates), or point in time (e.g., population count).
  • reference — Depending on the entity’s identity check and the property’s fixed or varying state the reference might vary. Also, for some targets, no query-able online reference might be available.
  • normalization/conversion of values — Depending on language/nationality of the article properties can have varying units (e.g., currency, metric vs imperial system).

The check for ambiguity is the most crucial step to ensure that the infoboxes that are being compared do refer to the same entity. We found, instances where the Wikipedia page and the infobox shown on that page were presenting information about different subjects (e.g., see here).


Examples

As a good sync target to start with the group ‘NBA players’ was identified. There are no ambiguity issues, it is a clearly defined group of persons, and the amount of varying properties is very limited. Information seems to be derived from mainly two web sites (nba.com and basketball-reference.com) and normalization is only a minor issue. ‘Video games’ also proved to be an easy sync target, with the main problem being varying properties such as different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU).

More difficult topics, such as ‘cars’, ’music albums’, and ‘music singles’ showed more potential for ambiguity as well as property variability. A major concern we found was Wikipedia pages that contain multiple infoboxes (often seen for pages referring to a certain type of car, such as this one). Reference and fact extraction can be done for each infobox, but currently, we run into trouble once we fuse this data.

Further information about sync targets and their challenges can be found on our Meta-Wiki discussion page, where Wikipedians that deal with infoboxes on a regular basis can also share their insights on the matter. Some issues were also found regarding the mapping of properties. In order to make GlobalFactSync as applicable as possible, we rely on the DBpedia community to help us improve the mappings. If you are interested in participating, we will connect with you at http://mappings.dbpedia.org and in the DBpedia forum.


Bottomline – We value your feedback!