Wikimedia Conference 2010/Developers' Workshop/Notes/MetaDataSearch
Problem: search on commons is source-text only. Inadequate for searching images
Things that need to be included: licences, meta-data, categories.
Goal for today's discussion: can we cover all three things with one solution, given that there are 4 different groups (SMW people, GSoc student, WMDE contractor & multimedia usability team) separately working on each of these problems.
current concurrent projects:
- WMDE: hired a contractor to rewrite catscan / category intersection
- recursive category evaluation/intersection remeins a standalone component. aids in evaluating meta-data, does not provide any.
- GSoC project: IPTC/XMP metadata extraction (mentor: ^demon)
- multimedia usability project (guillom / neilk)
- searching for images by meta-data properties
- suggesting categories based on meta-data provided on upload
- SMW folks
EXIF info in the DB currently stored as a serialized PHP array, which doesn't facilitate search goal: do better for IPTC/XMP in order to make it searchable
Two ways of exposing data to Lucene search backend:
- make a special table in database with key/value pairs, query it
- expose data about the article in XML dump (export-0.4.xsd)
current schema for xml dumps: docs/export-0.4.xsd
Conclusions
edit- exposing via xml dump can be quite complicated, how to structure the metadata?
- we will expose image metadata via xml dumps (Chad's student). Step 1 is extracting the data and storing it in some sane format. Serialized arrays aren't sane.
- semantic stuff need more pondering
- CatScan remain stand-alone TCP server written in C
Examples
editCurrent format (real example)
edit<page>
−
<title>
File:Bundesarchiv B 145 Bild-F024214-0006, Bonn, Landesvertretung Bayern, Kommunalpolitiker.jpg
</title>
<id>5453390</id>
−
<revision>
<id>37106144</id>
<timestamp>2010-04-01T00:00:04Z</timestamp>
−
<contributor>
<username>BotMultichill</username>
<id>211386</id>
</contributor>
<minor/>
−
<comment>
Adding author from {{BArch-description}} to {{BArch-License}}
</comment>
−
<text xml:space="preserve">
== {{int:filedesc}} ==
{{Information
|Description={{BArch-description
|comment= <!-- add translations and/or more description -->
|biased=<!-- if the original description text is biased, write here why! -->
|headline=Bonn, Landesvertretung Bayern, Kommunalpolitiker
|caption=Bayerische Kommunalpolitiker mit Minister Höcherl in der Landesvertretung Bayern und Jugendgruppe Volkshochschule Hesselberg
|extra=
|people=
}}
|Source=Deutsches Bundesarchiv (German Federal Archive), {{BArch-link|B 145 Bild-F024214-0006}}
|Author=Gathmann, Jens
|Date=1967-03-15
|Permission=[[Commons:Bundesarchiv]]
|other_versions=
}}
=={{int:license}}==
{{BArch-License
|signature=B 145 Bild-F024214-0006
|batch=B 145
|author=Gathmann, Jens
|year=1967
|month=<!-- 03 (omitted to avoid overly detailed category structure) -->
|location=Bonn <!-- Please leave as is, add appropriate categories directly. Exception: if needed, change "location=" to "topic=". -->
|topic= <!-- Please leave as is, add appropriate categories directly. Exception: if needed, change "topic=" to "location=". -->
|PD=<!-- set this if you are sure the image is PD -->
}}
[[Category:Landesvertretung Bayern Bonn]]
[[Category:Photographs by Jens Gathmann]]
</text>
</revision>
−
<upload>
<timestamp>2008-12-10T23:32:51Z</timestamp>
−
<contributor>
<username>BArchBot</username>
<id>465132</id>
</contributor>
−
<comment>
== {{int:filedesc}} ==
{{Information
|Description={{BArch-description
|comment= <!-- add translations and/or more description -->
|biased=<!-- if the original description text is biased, write here why! -->
|headline=Bonn, Landesvertretung Bayern, Kommuna
</comment>
−
<filename>
Bundesarchiv_B_145_Bild-F024214-0006,_Bonn,_Landesvertretung_Bayern,_Kommunalpolitiker.jpg
</filename>
−
<src>
http://upload.wikimedia.org/wikipedia/commons/9/98/Bundesarchiv_B_145_Bild-F024214-0006%2C_Bonn%2C_Landesvertretung_Bayern%2C_Kommunalpolitiker.jpg
</src>
<size>47316</size>
</upload>
</page>
Daniel's suggestion for XML format of metadata:
edit<page...>
<revision>...
<data ref:about="revision-uri">
<rdf:item rdf:property="some uri">...value...</rdf:item>
</data>
<data ref:about="page-subject-uri">
<rdf:item rdf:property="some uri">...value...</rdf:item>
</data>
<data ref:about="upload-uri">
<rdf:item rdf:property="some uri">...value...</rdf:item>
</data>
</revision>
<data ref:about="page-uri">
<rdf:item rdf:property="some uri">...value...</rdf:item>
</data>
<upload>
<data ref:about="upload-uri">
<rdf:item rdf:property="some uri">...value...</rdf:item>
</data>
</upload>
<page>