This page is part of the Proceedings of Wikimania 2005, Frankfurt, Germany.

WikiSense - Mining the Wiki

Author(s):'' {{{...}}}
License: Daniel Kinzler
Slides: Daniel Kinzler
Video: {{{radio}}}
'Note:' {{{slides}}}

About the slides: {{{agenda}}}

About license

GFDL

<include>[[Category:Wikimani templates{{#blocked:}}]]</include>

Abstract: PDF / OpenOffice
Paper (GFDL): PDF / OpenOffice
Presentation: PDF / OpenOffice
Wikimania talk: OGG audio / OGG video

Abstract

I would like to present a project that aims to apply techniques of data-mining and knowledge-management to the Wikipedia corpus. The idea is to extract semantic relations directly from the link structure, as opposed to trying to analyze natural language. Wikipedia is an excellent basis for such an analysis because every node in the web of links represents exactly one topic. The results may be used to benefit the Wikipedias and other Wikimedia projects. Key points are support of multilingual features and computer aided structuring.

Goals

From the analysis I hope to create a network of topics and their relations, which could be seen as a semantically rich dictionary or basic ontology. This would include relations on the lexical level (synonyms, homonyms, flexions, translations) as well as on the semantic level (is-a, element-of, component-of, opposite-of).

Techniques

The first step is a broad classification of pages (disambiguation, redirect, navigation/list, real topic page, etc). After that, links to other pages are analyzed, using collocation and cluster analysis. Interlanguage links will provide a useful basis for building a translation dictionary. Categorization will be looked at separately: some categories provide information about a specific facet of a topic, such as is-a, geographic location, timespan, etc. Also, categorization has to be handled as a transitive relation.

Additional information can be derived from simple pattern matching on the pages. Boilerplate elements like the townbox are especially helpful for that.

Uses

The data gained from this analysis may be used to enrich existing semantic dictionaries like WordNet and Wortschatz. In combination with ontologies like OpenCyc it may be used for automated text analysis, reasoning and translation.

This data may also be used to automatically generate or propose entries for the Ultimate Wiktionary and help to improve structural features like categorization, especially with respect to multilingual projects like the Wikimedia Commons.

Links

Using Ultimate Wiktionary for Commons - Wikimania Presentation (GerardM)
WikiData (Erik Möller) - MetaData (Jakob Voss)
SematicWeb (Markus Krötzsch et.al.)

WikiMedia Research: meta:Research
WikiMedia Wikipedistik: de:Wikipedia:Wikipedistik and de:Wikipedia:Wikipedistik/Bibliographie

Francesco Bellomi http://www.fran.it/blog/
Rudi Cilibrasi http://www.arxiv.org/abs/cs.CL/0412098 und http://www.newscientist.com/article.ns?id=dn6924

Paper (Draft)

Overview

I would like to present a project to automatically extract topics and their sematic relations from the structure of Wikipedia and other projects. The idea is to apply techniques of knowledge-management and data-mining to the data in the wikipedias, focussing on the link structure as opposed to the analysis of natural language. The Wikipedia a a very nice dataset for this kind analysis, because (nearly) every page describes exactly on topic. From this analysis I hope to create a network (graph) of topics and terms and their relations. This could be used for to create sematically enriched dictionaries or ontoliges for knowledge represantation, machine reasoning and automated translation. The results may also be used to benefit the Wikipedias and other Wikimedia projects. Key points are support of multilingual features and computer aided structuring.

Goals

The goal is to create a software that is able to analyse a MediaWiki database (or live site) and build a database that contains topics, terms (words) and semantic relations between those. All relations have a type and a confidence level.

The most important relations are:

Synonyms (redirects) and flexions (grammatical forms)
Homonyms (disambiguation)
Translations (interwiki)
Hyponyms (generalization)
Classification (is-a / instance-of)

Some more advanced relations are:

Elements (member of)
Components (part of)
Antonyms (opposite of)
Association of a time or timespan

Additionally, it may be possible to extract properties of things from the articles, using pattern matching. Some of the properties easily extracted from the existing data include:

For people: Date and place of birth and death
For places: Geo-Coordinates, geographics and political association

Techniques

Classification of pages Pages can be classified relatively easyly be looking at templates, categorization and some standard text. Main classes of pages are: Topic pages (real encyclopedia articles) Disambiguations (Homonym sets) Redirects (Synonym definitions) Lists and other navigations (this needs some heuristics) Portals (Topic overviews, often also used for maintanence)

Also, bad pages could be excluded from the analysis, for example:

Dead end pages
Deletion candidates
Very short pages (maybe)
Disputed pages (maybe)

Classification and Evaluation of Links

Most links can be classified by syntax or namespace. For this analysis, external links and links to pages in other namespaces can be ignored. There remain:

Normal Links to other pages

Links to other pages are the main way of building the semantic map: they are considered to represent a connection to related topics, although the exact nature of the relation is unknown: they are not interpretet as sematic relations, but as syntactic associations (colocations) from which sematic relations may be inferred.

Note that while sematic relations have a type and a confidence level, syntactic associations have no type but may have a weight. Links can be weighted by their position in the text (Links in the first sentence or paragraph are mor important, as are links in the „see also“ section or links that are bold).

In addition to the pure structure, the link text also gives us information we can use: the link text is in most cases a synonym, hypernym (sub-topic) or flexion of the name of the topic it points to.

Interlanguage Links

Interlanguage links are very important because they are a way to extract translations of a term into different languages. But we have to keep in mind that the Wikipedias have different levels of grannularity, so the link may point to a more general topic, i.e. the link may give us the translation of a hyponym (generalization) of the local topic.

Categorizations

Categorizations of pages (topics) may be used to classify a page, but usually already give us a valuable hint about sematic relations. Basically, it represents the relation „is-sub-topic“, which may then be narrowed down by more specific information.

Also, categories are topics temselfs – and usually more central ones. The information given by the category/subcategory structure can help to build sematic relations efficiently. (I also have proposed to drop the distinction between pages an categories) This however requres us to classify categories by the way their members relate to them. This is similar to the „facette-classification“ scheme discussed when categories where first introduced. Most importantly the following facettes should be identified:

Fields of research (like maths, politics, etc)
Space attribution (geographic categories)
Time attribution (time categories)
Classification categories (is-a relations)

The current practice is however to use categories with mixed sematics, which makes them difficult to use for the sematic analysis. For example, geographic categories like [[Category:Germany]] not only include places in germany, but also german people, german food, german history, etc (accordingly, this category should really be called „german“). This can be expressed by assigning multiple sematic relations to a category, each with a low confidence level.

Other categories are quite clear with regards to their sematics: [[Category:german people]] contains only people, so all topics in that category can get a is-a relation to the person topic with a high confidence.

Pattern matching on page content

Pattern matching is another important way of extracting sematic structure from an article text. It is not as sophisticated as natural speach analysis, but uses simple matches (regular expressions) to filter out some information. A good example is the time and place of birth and death of people.

Another example would be structures like townbox and taxobox – from those broilerplate elements, it's easy to extract properties and relations to other topics with heigh confidence. Also, the mere presence of a townbox implies that the page in fact describes a town, resulting in a is-a relation to town with a high confidence level.

Colocation analysis on the link structure

We can also apply a simple collocation analysis to the link structure: for example, two pages that link to the same page can be considered neighbours. If both pages both link to several pages, the neighbour-relation would get more weight. Based on this we can calculate the similarity of two pages by analysing how similar their weighted neighbour sets are.

This may be used to build clusters of pages, representig topic-areas or categories. By this we could for instance aid categorization: If most pages in a cluster are in a specific category, it is likely that the others should be in that category (or a subcategory) too.

Reasoning

From the information gathered, it is possible to conclude new relations. Specifically, some relations are transitive, and some relations imply others. This way we can gain some knowlegde about the relations of things in the Wikipedias that was never entered there explicitely. Because some relations are uncertain however, we must be careful to perpetuate the confidence level when concluding new relations from old ones.

Uses

The analysis proposed here would result in a large set of „concepts“ and relations between them. Concepts can also have attributes (like date of birth/death for people, etc). The relations extracted can be categorized into lexical relations (translations, synonym, homonym, etc), basic sematic relations (is-a, part-of, etc) and maybe some high-level sematic relations (is-place-of-birth-of, is-author-of, etc).

The lexical relations may be useful for dictionies like Wiktionary. It may even be possible to automatically generate (or suggest) entries. The sematic relations could be added to existing ontology systems and sematic dictionaries (like WordNet, OpenCyc and Wortschatz), that specialize in handling such relations. Ontology systems are often used for AI-related „soft“ applications, like automatic text analysis and topic recognition, machine translation or expert systems.

For the Wikimedia projects, this could help building tools that could help structuring content (semi-)automatically, for instance by suggesting categories based on text content. The translations discovered could be used implement a multilingual search or automated translation of category names. The latter was already suggested to be managed via the Ultimate Wiktionary – wich would be a good place to store those relations.

Outlook

Templates are good

From the perspective of analysis by a program, templates are a very good thing: they can easily be recognized in a text, and often represent an important attribute of the article's topic: for example, if an article contains a townbox template, it can be assumed to be about a town. Even better: template-parameters can often be interpretet directly as properties of the concept in question – a good example for this is the „Personendaten“-template („person-data“) in the german Wikipedia.

Categories are weak

Categories are currently used to express different relations at once, which often leads to confusion, both for people and for the automated analysis. A typical question would be, if the category „school book“ should contain books only, or also authors of school books. Relations often expressed by categories are: is-a, part-of, component-of, and subclass-of. However, other relations are common too, like the categorizations of musicians by genre.

Categories also pose the problem that they force us to have two separate entities for a single concept: an „article“ and a „category page“. This leads to the notion a „main“ article of a category, wich is really unnecessry: It would be much clearer if the „member“-articles would simply be „assigned“ to the „main“ article. That way, the „category“ would be a purly logical concept.

Ideally, articles could be „assigned“ in several ways, i.e. there would be several different relations possible between articles, like the once mentioned above: is-a, part-of, and so on.

Support for semantic relations

Support for sematic relations like suggested above would solve many problems the category system currently posts to users. It would also add another valuable source of information for automated analysis. When aided by automated suggestions, this would help to structure the Wikipedias. Relations could be defined using a similar syntax like for categories: [[is-a:city]], [[is-in:Germany]], etc. The set of possible relations should be configurable, like namespaces.

As a long-term perspective, it would be possible to define new relations just like pages. It would then become possible to also define relations between relations, for example that it is impossible to have [[is-a:book]] and [[is-a:person]] in the same article, or that [[is-in:happy]] makes no sense. This would mean to build an ontology, wiki-style.

Machine-Readable Wiki: RDF & co.

RDF is powerful standard to express relations between objects. On top of a very simple relational model, it defines properties, collections, class hierarchies, etc. It is often used for meta-data like license, authors, source, date, which would be a good idea for the wikipedia. Furthermore, it can also be used to express category listings, backlinks and the like.

RDF would be useful for the automated analyses in two ways: as a method of representing the result, and, more importantly, as a easy to parse data source. This is especially important when one wants to analyse only individual srticles or categories, without the bloat of the full database. RDF as an output format for all kinds of generated lists would also benefit bot development – it would be for instance a good way to list all contributors to an article, which is currently quite difficult.

Summary

It can be concluded that it is relatively simple to extract information about the semantic relations of topics from the data present in the Wikipedias, even without actually analysing natural speech. Also, a lot of lexical (dictionary) information like synonyms, translations and inflections can be extracted.

However, the semantic structure is limited to relatively few types of relations. To gain a more fine grained view, existing ontologies could be used. Generally, the set of data produced from the analysis is already quite interresting, it's best use would however be to be combined with existing semantic databases like Cyc, WordNet or Wortschatz.

The data gained from the analysis can also be used to aid the structuring of existing Wikipedias, and to support a better multilingual interface.

Examples

Clusters (nds:wikipedia)

Some results of a simple cluster analysis of the Low German (nds) Wikipedia, which has ca. 2000 Articles and 17000 (blue) links. The results work especially well for geographic articles, as apperently those are well developed in the nds wikipedia. Also, there is one very big cluster, which indicates that the algorithem and the threshold values it uses should be tweaked some more.

Anguilla | Aruba
Grootbritannien_un_Noordirland | Powys | Grootbritannien
Kööm | Beer
1894 | 1945
Mars | Sünn | Maand
Protozoa | Protisten
Zabrze | Chorzów
Westerwolds | Twentsch | Noord-Veluws
Argentinien | Brasilien | Bolivien | Venezuela | Ecuador | Paraguay | Lima
Chile | Peru | Surinam | Kolumbien | Uruguay | Guyana
Blinker | Kieker_(Reekner)
Tallinn | Viljandi
Ravenna | Theoderich_de_Grote | Theoderich
Kuba | Haiti | Jamaika | Antigua_un_Barbuda | Grenada | Dominica | Dominikaansche_Republiek | St._Lucia | St._Kitts_un_Nevis | Karibik
Bahamas | Barbados | Trinidad_un_Tobago | St._Vincent_un_de_Grenadinen
Afghanistan | Irak | New_York | Bagdad | Terrorismus | Jordanien | Kuwait | Katar | Jemen | Kirgisien | Thailand | Saudi-Arabien
Libanon | Bhutan | Syrien | Süüdkorea
Malaysia | Indonesien | Myanmar | Tadschikistan
Kambodscha | Laos | Vietnam
Afrika | Antarktis | Algerien | Ägypten | Libyen | Tunesien | Marokko | Angola | Botswana | Sambia | Somalia | Tansania | Mosambik | Sudan | Malawi | Ghana | Gambia_(Land) | Äthiopien | Elfenbeenküst | Sierra_Leone | Süüdafrika_(Land) | Saint_Helena
773 | 772 | 776
Namibia | Liberia | Mauritius | Simbabwe | Kap_Verde | Guinea-Bissau | Swasiland
Atom | Anion | Anorganisch_Chemie | Blood | Base | Chemie | Chemische_Reaktschoon | Chemisch_Formel | Chemisch_Verbinnung | Chemische_Grundbegrepen | Elektrolyse | Energie | Elektron | Gaumookerphysik | Ion | Ioniseern | Iesen | Kation | Kohlensüür | Kohlenstoff | Molekül | Metall | Mineralogie | Noorddüütsche_Affinerie | NaCl | Natronlaug | Suerstoff | Solt | Swevel | Sülver | Süür | Theoter | Water | Waterchemie | Waterstoff | Tinn | Swefelsüür | Elektronik | Kiel_(Schipp) | Kopper | Kristallwater | Koppersulfaat | Proton
Maandag | Dingsdag | Dunnersdag
Freekark | Liste_vun_de_Freekarken_in_Düütschland
Lyrik | Literatur
Mali | Niger | Nigeria | Madagaskar | Guinea | Burkina_Faso | Tschad | Dschibuti | Zentraalafrikaansche_Republiek | Demokraatsche_Republiek_Kongo | Nairobi
Chemisch_Element | Chemisch_Stoff | Nichtmetalle | Periodensysteem | Sott
Kanada | Terror | USA | Belize | Mexiko | Honduras | Panama | Nicaragua | Guatemala | Costa_Rica | El_Salvador | Jazz | Blues | Charlie_Parker
Botanik | Eukaryota | Zoologie | Beest | Archaeen | Bakterien | Anatomie
Australien | Nauru | Kiribati | Niegseeland | Palau | Vanuatu | Tuvalu | Tonga | Samoa | Fidschi | Marshallinseln | Oosttimor | Papua-Niegguinea | Salomonen | Mikronesien | Ozeanien
Rumäänsch | Istrorumäänsch | Dakorumäänsch
Albanien | Andorra | Adam_vun_Bremen | Amerika | Asien | Anatolien | Ankara | Athen | Araabsche_Spraak | Atlantik | Belgien | Bosnien-Herzegowina | Bulgarien | Berlin | Barg | Baden-Württemberg | Bayern | Bewick | Billers | Baltikum | Preßburg | Christoph_Kolumbus | Düütschland | Däänmark | Düütsche_Spraak | Danzig | Däänsche_Spraak | Europa | Etymologie | Estland | Eerdeel | Eider | Eurasien | Finnland | Frankriek | Flüsse | Franzosentiet | Geographie | Grekenland | Gröönland | Grunneng | Groningen | Grunnengs | Gallien | Gaius_Julius_Caesar | Griepswohld | Hööftsiet | Hööftstadt | Hansetiet | Hessen | Hannober | Hanse | Hinnerk_De_Leuw | Ilv | Ingelsch | Italien | Irland | Island | Iesenbahn | Insel | Kark | Königriek | Klenner | Kroatien | Kiel_(Stadt) | Königsbarg | Klaipeda | Kirow_(Stadt) | Latiensche_Spraak | Lettland | Litauen | Luxemburg | Liechtensteen | London | Ljouwert | Lübeck | Labskaus | Monarkie | Middelöller | Mekelnborg-Vörpommern | Makedonien_(Land) | Malta | Moldawien | Monaco | Montenegro | Makedonien | Moskau | Middelamerika | Memel | Nedderlannen | Nokieksel | Neddersassen | Noordamerika | Norwegen | Noordsee | Nedderlandsche_Spraak | Oostfreesland | Ole_Tiet | Oostsee | Plattdüütsch | Polen | Plattdüütsch_Vokabular | Plattdüütsche_Orthographie | Portugal | Paris | Plautdietsch | Pommern | Religion | Religionen_vun_de_Welt | Russland | Römertiet | Rumänien | Rom | Röömsch_Riek | Rhienland-Palz | Rostock | Rügen | Riga | Swiez | Sassen | Sweden | See | Sleswig-Holsteen | San_Marino | Serbien_un_Montenegro | Slowakei | Slowenien | Spanien | Süüdamerika | Sibirien | Sankt_Petersborg | Sassen_(Bundsland) | Sassen-Anhalt | Städer_up_de_Eer | Stettin | Swerin | School | Soziologie | Skandinavien | Steentiet | Tschechien | Törkie | Transsibirisch_Isenbahn | Thüringen | Ukraine | Ungarn | Vatikaan | Vilnius | Wikipedia | Werser | Wittrussland | Warschau | Weströömsch_Riek | Zypern | Österriek | 1998 | 1492 | 395 | 596 | Spraken_vun_de_Welt | Bronzetiet | Upnohm_vun_niege_EU-Länner | EU | Republiek | Wetenschop | Stadt | Hamborger_Platt | Westfäälsch_Platt | Ollnborg | Ostnederdüütsch | Neddersassisch | Westfalen | Noordrhien-Westfalen | Niege_Hanse | Brannenborg_an_de_Havel | Kaliningrader_Oblast | Maschinenbu | Fritz_Reuter | Schriever | Football_EM_2004 | Bundsland | Kiel | Ingväonsche_Spraken | Fluss | England | Horst_Köhler | 2004 | Bunnspräsident_(Düütschland) | Johannes_Rau | Roman_Herzog | Theodor_Heuss | Heinrich_Lübke | Gustav_Heinemann | Walter_Scheel | Richard_von_Weizsäcker | Karl_Carstens | Bunnskanzler_(Düütschland) | Konrad_Adenauer | Ludwig_Erhard | Willy_Brandt | Helmut_Schmidt | Helmut_Kohl | Joschka_Fischer | Bunnsministerium_för't_Verdeffenderen | Gerhard_Schröder_(CDU) | Hans-Dietrich_Genscher | Klaus_Kinkel | Vizekanzler_(Düütschland) | Franz_Blücher | Jürgen_Möllemann | Erich_Mende | Hans-Christoph_Seebohm | Peter_Struck | Rudolf_Scharping | Volker_Rühe | Gerhard_Stoltenberg | Rupert_Scholz | Manfred_Wörner | Hans_Apel | Franz_Josef_Strauß | Kai-Uwe_von_Hassel | Theodor_Blank | 15_April | CDU | Bonn | Westplatt | Franzöösche_Spraak | Rhien | Oste | Donau | Haven | Wikinger | Japan | Balje_(Neddersassen) | Nordkehdingen | Südkehdingen | Amtsspraak | Stralsund | Middelmeer | Kosovo | Soest | Armenien | Aserbaidschan | Georgien | München | Bangladesch | Indien | Malediven | Mongolei | Brunei | Kasachstan | Nepal | Sri_Lanka | Singapur | Republiek_China | Noordkorea | Usbekistan | Philippinen | Turkmenistan | Volksrepubliek_China | Palästinensische_sülvstregeerte_Rebeden | Pazifische_Ozeaan | 1969 | 1963 | SPD | Provinz_Grunneng | Bozen | Düsseldörp | 27._April | Göttingen | Stavanger | Noordneddersassisch | Okzitansch | Breslau | Baukem | Meideborch | Kraków | Elbing | Frauenburg | Bergkamen | Sławno | Darłowo | Kamen | Breckerfeld | Lünen | Werne | Kaunas | Unna | Dööp | Germaansche_Spraken | 2005 | Essen | Nikosia | Hebrääsche_Spraak | Fröndenberg | Schweierte | Mennoniten | Hattingen | Swelm | Gevelsberg | Jesus_Christus | Ventspils | Smolensk | Nedderdüütsch | Rheine | Naugard | Provence | Frohnhausen | Gereformeerde_Kerken | Kornelius_Wiebe | Neuapostolische_Kirche | Pskow | Belosersk | Ruihen | Havelberg | Werl | Tangermünde | Demmin | Stendal | Werben | Osterburg | Lippstadt | Dollar | Euro | Masowier | Balve | Neuenrade | Holsterhausen | August_Hinrichs | Theodorianum | Poitevin-saintongeais | Hoochdüütsch | Warburg | Płock | Serbokroatsch | Spraken_in_Frankriek | Dülmen | Arnsberg | Altena | Penthouse | Coesfeld | Werner_Heisenberg | Nikolaus_Kopernikus | Peckelsheim | Städer_in_Polen | Bydgoszcz | Kielce | Gdynia | Oppeln | Blankenstein_(Hattingen) | Kattowitz | Astuursch | Bielsko-Biała | Sosnowiec | Dąbrowa_Górnicza | Langues_d'oïl | Picardsch | Lorrain | Franc-comtois | Walloonsch | Bourguignon | Champenois | Gaiseke | Nedderfranksch | Mandarin | Spaansche_Spraak | Limburgisch-Bergisch | Lüdenscheid | Buddhismus | Hindi-Urdu | Bredeney | Menden | Neustadt | Litausche_Spraak | Sloweensche_Spraak | Patterbuorn | Marie_Curie | Aristoteles | Platon | Kaschubsch | Malaische_un_indonessche_Spraak | Kommunismus | Iserlaun | Kollenhaordt | Beälke | Peter_de_Grote | Südgeldersch | Tony_Blackplait | Kujawier | Pund | Hollandsche_Dialekt | Vincent_van_Gogh | Paul_Cézanne | Tweet_Weltorlog | Hinduismus | Usâmah_bin_Lâdin | Frédéric_Chopin | Mönster | Akbar | Kolonisation_vun_Süüdamerika | Kolonisation_vun_Noordamerika | Koreaorlog | Quedlinburg | Johannes_Paul_II. | Tamerlan | Olwestfälsch | Pund_Sterling | Russ'sche_Börgerorlog | Kalter_Krieg | Sowjetische_Besetzung_Afghanistans | Grieth | Konfuzius | Konfuzianismus | Taoismus | Ingelsche_Börgerorlog | Siddhartha_Gautama | Opdecken_vun_Amerika | Hirschberg | Langscheid | Harriet_Tubman | Regionaalspraak | Königin_Elisabeth_I. | Emma_Goldman | Rosa_Luxemburg | Enger | Noordzypern | Halberstadt | Plettenberg | Arumuunsch | Meglenorumäänsch | List_vun_dat_Weltarv | List_vun_de_Grootstäder_in_Düütschland | Blohm_&_Voss | Pisa | Johann_Sebastian_Bach | Thulla | Nanak | Spraakwetenschop | Ido
Week | Weekdag | Middeweken | Freedag | Sünndag
Benin | Burundi | Kamerun | Ruanda | Mauretanien | Kenia | Togo | Uganda | Lesotho | Gambia | Senegal | Seychellen | Äquatoriaal-Guinea | Komoren | Eritrea | Gabun | São_Tomé_un_Príncipe | Republiek_Kongo | Mayotte | Gangnihessou
Aant | Duun | Deerriek | Fedder | Gans | Vagel | Söögdeer | Puter | Systematik_(Biologie) | Oort_(Biologie) | Amphibia | Cnidaria | Mesomycetozoa | Ciliophora
Othmarschen | Bahrenfeld | Mottenburg
Irakorlog | Situatschoon_in'n_Irak
Bülgenläng | Farv
Balje | Land_Kehdingen
20_Juni | 14_November
Sünnavend | Saterdag
Euglenozoa | Apicomplexa
Astronomie | Mathematik
Reekner | Linux_op_Platt | K_Desktop_Environment | Böverflach | Muus_(Reekner) | Elektrotechnik | Nettkieker | Kieker | Software | Opera | Bedriefssysteem | Linux | Firmware | Unix | Microsoft_Windows | Hardware | Reeknernettwark | Nettwark
Israel | Nahoost | Bahrain | Iran | Oman | Pakistan | Vereenigte_Araabsche_Emiraten | Semitsche_Spraken | Zarathustra
Bremen | Hamborg | Holsteen | Rechtenfleth | Langeoog | Flottbek | Altona | Iserbrook | Lurup | Ottensen | Blankenese | Sülldorf | Rissen | Bezirk | Nienstedten | Wachholtz_Verlag
Düörpm | Dorsten | Herne
Niederpreußisch | Mark-Brannenborger_Platt
Experiment | Natuurwetenschop | Afk
Hohn | Bueree
Nischni_Nowgorod | Wladiwostok
Tarnów | Legnica
Billerbeck | Borghorst
Oer-Erkenswick | Marienmünster
875 | 1076
Reformatschoon | Nieg_Testament
Fürstenau | Quakenbrück
Merseburg | Naumburg_(Saale)
Łódź | Wałbrzych
John_Major | Premierminister_vun_Grootbritannien
12_Dezember | 1903 | 1949
Book | Juristeree | Kultur
Chäsekerken | Castrop-Rauxel | Riäkelkusen
Rindveeh | Schaap | Zeeg
Brackwater | Meer | Soltwater | Fluss_(Water) | Stroom_(Water)
Eresborg | Karl_de_Grote
Sövenden-Dags-Adventisten | Baptisten
Westgermaansche_Spraken | Südfränkisch
Melk | Snacks | Veehtüch | Landwertschaplich_Bedrief
Natschonaalversammeln_vun_Wales | Wales | Cardiff
Minsch | Reptilia | Muus
Greunkohl | Plant | Kruut | Boom | Photosynthese | Chloroplast
Augustin_Wibbelt | Eli_Marcus
Christelijke_Gereformeerde_Kerken | Gereformeerde_Kerke
Nieheim | Bödefeld
Planet | Dag | Eer | Präzession
Springtid’ | Tiden
Seehausen | Soltwedel
Biologie | Medizin | Physik | Tardigrada | Poggenstöhl
Gravitatschoon | Johr
Tietrebeet | Internatschonale_Telefoonvörwahl
Visby | Kokenhusen
Dinslaken | Voerde
Ammerland | Bad_Twüschenahn | Ammerlänner_Buurnhuus
Galina_Starowojtowa | Wladimir_Putin
Stedinger | Karl_Rudolf_Brommy
Wizebsk | Polasier
Eeten | Foot | Eten
Auerk | Freesland
Köslin | Belgard
Drinken | Buddel
1_Juli | 22_Februar | 7_Juli
Holt | Köök
Vreden | Emmerek
Jadebusen | Asegabook
Stargard | Attendorn
Kuldiga | Cesis
Katt | Hund
Sokrates | Ole_Grekenland
Heinrich_von_Brentano | 5_Januar
Kruutsand | Stood

Wikimania05/Paper-DK1

WikiSense - Mining the Wiki

Contents

Abstract

Goals

Techniques

Uses

Links

Paper (Draft)

Overview

Goals

Techniques

Classification and Evaluation of Links

Normal Links to other pages

Interlanguage Links

Categorizations

Pattern matching on page content

Colocation analysis on the link structure

Reasoning

Uses

Outlook

Templates are good

Categories are weak

Support for semantic relations

Machine-Readable Wiki: RDF & co.

Summary

Examples

Clusters (nds:wikipedia)