Wikimania05/Paper-JV2


Metadata with Personendaten and beyond

edit
  • Author(s):'' {{{...}}}
  • License: Jakob Voss
  • Slides: Jakob Voss
  • Video: {{{radio}}}
  • 'Note:' on the editors hard disk

About the slides: {{{agenda}}}

About license

GFDL and CC-by-sa

<include>[[Category:Wikimani templates{{#blocked:}}]]</include>

Around 20% of all articles in German Wikipedia are about people. Meanwhile style guides have emerged to start biographic articles with name, date and place of birth and death and a short description. But this information is only for human beings and there are too many exceptions. To search for people by year and place of birth, additional categories have been added with contentious success. So a template with a set of defined data fields called Personendaten (see http://de.wikipedia.org/wiki/Wikipedia:Personendaten, in German) was created. In the following history and experiences of this metadata are explained as well as future possibilities.


Motivation

edit

Motivation to introduce Personendaten was the experience of the publisher Directmedia Publishing when we created the first CD-edition of German Wikipedia (Directmedia, 2004; see also http://de.wikipedia.org/wiki/Wikipedia:Wikipedia-CD).

In order to do that, it was necessary to re-order the names of persons from the natural order (first name, then family name) to the rules of alphabetic categorization (RAK) [1] and to separately display birth and death dates. Therefore, it was proposed to create a possibility to extract the correct name components and a short description, birth date/place and death date/place for the index of persons directly from the Wikipedia articles (see [2] and [3] [both in German]). To do this, the "Vorlage Personendaten" ("Template person data", [4]) was created in October 2004 with the following data fields:

  • NAME (official name of the person)
  • ALTERNATIVNAMEN (additional names)
  • KURZBESCHREIBUNG (short description, for instance profession and nationality)
  • GEBURTSDATUM (date of birth)
  • GEBURTSORT (place of birth)
  • STERBEDATUM (date of death)
  • STERBEORT (place of death)

The Template is placed at the end of a Wikipedia article between categories and interwiki-links. By default is it invisible for normal users. Logged-in users can toggle view on by adding a CSS statement to their Monobook.css: table.metadata { display:block; display:table; }. Below an example of a Personendaten record:

{{Personendaten|
 NAME=Ringelnatz, Joachim
|ALTERNATIVNAMEN=Bötticher, Hans (Geburtsname)
|KURZBESCHREIBUNG=deutscher Schriftsteller und Maler
|GEBURTSDATUM=7. August 1883
|GEBURTSORT=[[Wurzen]] bei Leipzig
|STERBEDATUM=17. November 1934
|STERBEORT=[[Berlin]]
}}

Implementation

edit

The speedy implementation of around 30.000 Personendaten in January 2005 has been particularly made possible thanks to the technical encouragement of Christian Thiele (APPER in German Wikipedia) and a community party organised by Directmedia Publishing.

From the intro sentences of person articles, APPER could generate suggestions for Personendaten with a specially developed program (see image below and http://www.apper.de/pd/). These suggestions still had to be checked, corrected and entered into the articles. Because the date of the "snapshot" for the DVD edition was approaching, Directmedia Publishing, the publisher of the DVD, decided to accelerate the entry of person data. From January 28 to 30, 2005, they organized a "tagging party" for the wikipedians, where they could enter person data around the clock in a social atmosphere ([5]). The social event helped to tag ultimately 30,000 articles until the end of January 2005.


A screenshot of APPER's tool for Personendaten (http://www.apper.de/pd/)

After a short note about the results of the common efforts (see [6]) in which people remarked that it would be useful to use standard files (?), representatives of the Deutsche Bibliothek (DDB) got in touch with us. The Deutsche Bibliothek, having the status of a national library in Germany, administrates amongst others the Personennamendatei (PND, authority records for personal names), which has library datasets for more than 600,000 persons and contains more than 2 million names. Every record of the PND has a unique number that for instance helps keeping apart persons with identical names. We quickly agreed to meet up to get to know each other.

At the Bibliothekartag (annual German library conference) we arranged a cooperation to enrich articles in the German Wikipedia with PND numbers and links into DDB's catalog. By this means Wikipedia can profit from controlled links to literature from and about specific people and PND can profit from additions and corrections done in Wikipedia. There is a simple template to include PND numbers in articles in the Weblinks section. For instance {{PND|118601121}} in the article about Joachim Ringelnatz produces a link titled "Literatur von und über Joachim Ringelnatz"

DDB provided a way to search for PND numbers by name and Christian Thiele wrote another user-friendly script. With his script (see http://www.apper.de/wikipedia/pnd) you can fetch PND data from DDB, align it with Personendaten and add the number to the corresponding Wikipedia article in one step. Afterwards biographic Wikipedia articles and its PND numbers can be extracted out of the German Wikipedia SQL dump. Having been introduced at the project's mailing list, soon a lot of Wikipedians started to add PND numbers. After a single week around 10.000 PND numbers had been tackled.


APPER's tool for PND numbers (http://www.apper.de/wikipedia/pnd/).
Here multiple PND numbers are found and the user has to select the corresponding one for a Wikipedia article

At the moment PND numbers only link to the OPAC of Die Deutsche Bibliothek. There you find a lot of books about or of specified people but mostly in German. To include more literature you can add more libraries and catalouges. For instance Kalliope, a database of Autographs at the Staatsbibliothek zu Berlin also uses PND numbers. Even more information can be connected if you synchronize with additional authority files and bibliographic databases like the Library of Congress Name Authority Files (LCNAF), the Union List of Artist Names (ULAN), the Allgemeine und Neue Deutsche Biographie (ADB/NDB), the Allgemeines Künstlerlexikon (AKL) and - probably the largest one - the World Biographical Information System (WBIS) with more than 5 million records.

Analysis

edit

Personendaten and PND numbers are included in Wikipedia articles using templates but there is no additional adaptation of MediaWiki software for them. For detailed analysis you have to extract their data out of Wikipedia's database. Some perl-scripts were used to extract Personendaten, normalize and transform sand, and to save it into a Data Warehouse. In Data Warehousing data is split up into the smallest entities you want to analyse and is it saved in multiple resolutions. Due to denormalization you need more space but requests can be answered more efficiently. With these prepared Personendaten you can also find missing or irregular information.

To show an example of the manifold possibilities of analysis here a distribution of ages of people that are dealt with in German Wikipedia articles and the average lifetime of people that already died (created out of the SQL dump of of 2005-06-23):


Examples of analysis created with Personendaten

Additionally a bibliographic search engine can be implemented using the database of Personendaten. A first prototype is available at http://wdw.sieheauch.de/people_today.php for demonstration. For example there you can list all people that died or were born at a specific day or in a specific year.

A first overview with a sample of 500 articles showed that around 42% (±4 at 95% confidence level) of all biographic articles with Personendaten can be easily mapped to PND-records. In 19% of all cases there are names that match with differing years of birth and/or dead (but in some cases years don't differ much or only the year of dead is missing). For 39% (±4) no PND number was found. Probably the fraction of articles that can be mapped can be increased manually.

Semantics

edit

Personendaten and PND numbers only shape a first step to enrichement and linking of articles in Wikipedia with semantic information and other data providers. Semantic extensions can both be specified explicitely (mostly via templates like Personendaten, and geographical coordinates) and implicitely in the normal content of articles. The most explicit way is to use a database with fixed fields instead of free texts. Erik Möller (2004) proposed such a system (Wikidata) but supposedly there is a difference between collaboratively editing text or data fields.

Daniel Kinzler (2005) tries to statistically extract semantic relations directly from the link structure. Collocation and cluster analysis help creating a network of topics and their relations out of Wikipedia articles. Bellomi and Bontano (2005) use a similar approach. They apply algorithms of network analysis - especially HITS by Kleinberg (1999) - to get authorities of oftenly linked articles. These authorities are mostly concepts used to structure space and time like country names, city names and other geopolitical entities. Natalia Kozlova (2005) tries to extract a whole ontology out of Wikipedia by extracting links and concept types while structural elements of Wiki articles are also considered. But links between Wikipedia articles do not contain a specified meaning so it's fairly complex to extract precise relations. Krötzsch, Vrandečić and Völkel (2005) propose to add optional types to links between Wikipedia articles. For instance in the article about Ringo Starr you could write that he is best known as [[drummer||is-a]] for [[The Beatles||part-of]] to indicate relationships between the person Ringo Starr, the job of a drummer and the music group The Beatles. Creating link types would be similar to creating and editing Wikipedia categories. In David Aumueller's approach (2005) relationships are also modeled as Wiki pages. Typed links or relationships and adressable entities are basics for the Semantic Web as proposed by Tim Berners Lee (1998). Like the vision of Semantic Web at all it seem to be simple in small examples but have never been proven to work at a larger scale in heterogeneous environments. Experiences with Dublin Core Metadata show that technique does not solve that you have to agree about the exact meaning of relations (see Tennant, 2004 for a disenchanting view). Before trying to add new semantics you should keep in mind that it is already complex to define existing structures of Wikipedia as semantic relations (see Harth et. al., 2005). But maybe the ermerging phenomenon of social tagging can help. By the way categories in Wikipedia also tend to be more a fuzzy type of tagging but a strict system of logic relationships. However Wikipedia promises to be a potential playground for semantic relations.

A solution for unique and defined entities is authority control. Libraries have aggregated world's largest collections of controlled vocabularies of different kinds like Personennamendatei and Library of Congress Name Authority File (both will be combined in project VIAF), Getty Thesaurus of Geographic Names and Dewey Decimal Classification. Connecting Wikipedia with these established systems of subject indexing both grounds Wikipedia articles in semantic systems and lets other sources directly link to and include Wikipedia content. But also collaboration with smaller databases - for instance musicbrainz - is promising (see also the list of links to databases at http://de.wikipedia.org/wiki/Wikipedia:Datenbanklinks).

Because semantic is a complex issue there won't be a single simple solution but a mixture of concepts. Combining all semantic information that is present in Wikipedia you can already answer questions like "are there any famous 19th century scientists of this town?" (see image).

Combining various semantic information in Wikipedia
Combining various semantic information in Wikipedia

The indisputable advantage of Personendaten and other concepts over academic suggestions is that the former already have been implemented and are beeing used inside Wikipedia (something that would never have been achieved without APPER's tools). They can be used without modification of the existing Software. They can help in searching and sorting bibliographic articles as well as detecting errors in articles. With more than 40.000 indexed biographical articles Wikipedia is becoming an authority on its own. Connecting its structured metadata to other authority files or semantic web applications will push forward usage of free knowledge - and that is the goal of Wikipedia.

References

edit
  • Aumueller, David (2005): SHAWN: Structure Helps a Wiki Navigate. In: Proceedings of the BTW-Workshop "WebDB Meets IR". http://dbs.uni-leipzig.de/~david/2005/aumueller05shawn.pdf
  • Bellomi, Francesco and Bonato, Roberto (2005): Network Analysis for Wikipedia. In: Proceedings of Wikimania 2005 [7], August 2005
  • Berners-Lee, Tim (1998): Semantic Web Road map. September 1998 http://www.w3.org/DesignIssues/Semantic.html
  • Directmedia (2004). Wikipedia. Ausgabe Herbst 2004. Directmedia Publishing, October 2004, ISBN 3-89853-019-1
  • Harth, Andreas et. al. (2005): WikiOnt: An Ontology for Describing and Exchanging Wiki Articles. In: Proceedings of Wikimania 2005 [8], August 2005
  • Krötzsch, Markus; Vrandečić, Denny and Völkel, Max: Wikipedia and the Semantic Web – The Missing Links. In: Proceedings of Wikimania 2005 [9], August 2005
  • Kinzler, Daniel (2005): WikiSense - Mining the Wiki. In: Proceedings of Wikimania 2005 [10]
  • Kleinberg, Jon (1999): Authoritative sources in a hyperlinked environment. In: Journal of the ACM, volume 46, number 5, pages 604-632
  • Möller, Erik (2004): Die heimliche Medienrevolution. Heise, 2004, ISBN 3-9363931-16-X
  • Kozlova, Natalia (2005): Ontology Extraction for XML Classification. Universität des Saarlandes (Master Thesis)
  • Tennant, Roy (2004): Metadata's Bitter Harvest. In: Library Journal, volume 129, number 12, page 32