Grants talk:Project/DBpedia/GlobalFactSyncRE/Timeline/Tasks

Documentation of the current prefusion-dump/MongoDB setup edit

Documentation of the current prefusion-dump/MongoDB setup under https://git.informatik.uni-leipzig.de/gfs/main/blob/master/global.dbpedia.org.md. by Marvin. Tina Schmeissner (talk) 13:14, 11 June 2019 (UTC)Reply


Sebastian Hellmann commented here: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#A_new_use_for_Wikidata_external_IDs_in_Wikipedia_(template) Tina Schmeissner (talk) 13:21, 11 June 2019 (UTC)Reply

Challenge edit

We want to announce a challenge to hopefully find a good intern to work on the project. A first draft can be found here: challenge. Any ideas for improvement are welcome. Tina Schmeissner (talk) 12:10, 14 June 2019 (UTC)Reply

Initial version of references extraction from infoboxes edit

email from Krzysztof:

We have an initial version of references extraction from infoboxes. The project URL is https://git.informatik.uni-leipzig.de/kwecel/infoboxes-refs

So far the script extracts raw references, i.e. without further parsing. It just puts what is available between <ref></ref>. Please not that some references have their names, hence we leave just names with the goal to further processing during the extraction phase. Moreover, it is more convenient for potential joining with another table in which we could extract reference once and use in many places.

The following columns can be found in the output.

1- Wikipedia_article: name/title of the Wikipedia article

2- Infobox_name: name of the infobox; list of infoboxes is contained in a separate directory and was prepared based on analysis what template is really an infobox

3- Parameter_name: raw property in DBpedia notion; identifies row in an infobox

4- Reference_name: name of the reference, if provided; if not, the following value is used instead: "<noname_ref>"; names are unique only within given article; sometimes reference names is defined outside of an infobox

5- Reference_direct_code: raw code, as explained above; this is main input for further development


Włodek will upload the code. There are also some examples in output folder - ca. 10000 rows for selected languages. We can upload the samples just for overview directly to gitlab. For full dumps we need to discuss the destination. Where data should be uploaded?

Tina Schmeissner (talk) 09:11, 17 June 2019 (UTC)Reply

Factual Consensus Finder - UI edit

I understand what the FCF does, but there are still a bunch of questions:

1. How or where do I enter the subject / entity that the infobox belongs to on the page? Do I always need the DBpedia identifier?

2. How will the user be able to reach this page from a Wikipedia page? I assume ideal case scenario would be if eventually there was a link to the FCF page somewhere in the infoboxes.

3. Using DBpedia as an example:

predicate # of values and sources questions Feedback from Marvin
description 1 Result is “semantic web” for German wiki, but this is not shown anywhere in the infobox of the German wiki. “semantic web” is listed in the IB with the predicate "Beschreibung", but not shown in the actual IB
latest release version 5 First value is empty, with 4 wikis as sources. There are empty but valid triples being extracted
developer 5 Why are the universities listed in all these languages (why not just in the language of the respective wiki?), and why are they linked to their respective FCF pages? not yet discussed

Tina Schmeissner (talk) 12:07, 18 June 2019 (UTC)Reply

Two docs about fixing mappings edit

you can also see https://docs.google.com/document/d/1yZLNKZ802pC-U0PYMqnyem9KZn5qADccXR2Te2wlr6Q/edit and https://svn.aksw.org/papers/2018/SAC_DBpedia_mappings_alignment/public.pdf Sent from Dimitris, 13:00, 8 July 2019 (UTC)


MusicBrainz - SameAs Problem edit

Found this paper: Automatic Interlinking of Music Datasets on the Semantic Web ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-369/paper18.pdf SebastianHellmann (talk) 08:16, 9 July 2019 (UTC)Reply


DBpedia extractor + Infobox references exctractor edit

Example on extracting references from article about Facebook in English Wikipedia:


DBpedia extraction framework on this page:

  1. there is no parameter "rww źródło" in article about Aceton in PL Wiki: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/pl/extract?title=Aceton&revid=&format=json&extractors=custom


Updates edit

  1. parameter 'spouse' has two values (names) and each value has additional data (dates)
  2. parameter 'award' not parsed correctly (there is list in template 'Plainlist') -> values and reference 'frs' not found.

--Lewoniewski (talk) 09:06, 12 August 2019 (UTC)Reply


Upgraded version of Python Infobox Reference Extractor (PIRE):


Extraction statistics in September 2019:

citation_id edit

In both versions of the parser for each citation template special 'citation_id' parameter is generated based on values of one of the following citation template parameters:

The order is important - depending on which parameter is found first, parser will generate appropriate ID. If there is no such parameters, parser generate id with the hash 'http://citation.dbpedia.org/hash2/...' based on the 'title' parameter or (if empty) based on citation template content. --Lewoniewski (talk) 09:05, 12 August 2019 (UTC)Reply

References names/metadata edit

  1. https://fr.wikipedia.org/wiki/Mod%C3%A8le:Bioref
  2. https://pl.wikipedia.org/wiki/Szablon:FP9
  3. many other, for example: https://en.wikipedia.org/wiki/Category:Chemistry_citation_templates or https://en.wikipedia.org/wiki/Category:Specific-source_templates

Errors handling in wikicode edit

  • There is no pair of brackets for template in the infobox about Warszaw in Polish Wikipedia (this revision):
 |rok                       = 
 |liczba ludności           = 1 777 972 (31.12.2018)</small><ref name="GUS 2018">{{Cytuj stronę |url = http://demografia.stat.gov.pl/bazademografia/Tables.aspx</ref>
 |gęstość zaludnienia       = 3412 <small>(1.01.2018)</small><ref name="GUS 2018" />
  • There is no "=" between name and value of parameter. Example on wiceprezydent parameter from this revision:
|pierwsza dama = [[Margarita Penón]]
 |wiceprezydent<br />1. [[Jorge Manuel Dengo Obregón]] (1986-1990)<br />2. [[Victoria Garrón Orozco]](1986-1990)<br />1. [[Laura Chinchilla]] (2006-2010)<br />2. [[Kevin Casas Zamora]] (2006-2010)
 | quote =
'''R5 (silnik)|R5'''
    • Pay attention to (in code with comment PPnPP):
      • length of parameter name of the infobox.
      • length of parameter value and number of the references.

URLs extraction from references edit

Wikipedia infoboxes edit

Here are statistics of extraction of references URLs from infoboxes in different Wikipedia languages (based on dumps from September 2019):

Files with "_domains" shows domain usage frequency in the references, "all_domains.txt" - summation of results from all considered language versions of Wikipedia.

Wikidata edit

Similar statistics for Wikidata (based on dumps from October 2019):

In files with "_unique" - only unique URL in references per Wikidata item was taken into the account.

Return to "Project/DBpedia/GlobalFactSyncRE/Timeline/Tasks" page.