Semantic MediaWiki/Open issues

The ideas and visions articulated in this project are wide-reaching, and many objections can be made. Some of the apparent problems have been adressed in Semantic Web research earlier, and satisfactory solutions are known or underway. Others might be limitations of the current approach and it is important to coordinate our current efforts to allow for future extensions that overcome these limits. Finally, the encyclopaedic Wiki-environment we are dealing with is an additional challenge that may create problems.

Unclear semantic modelling

Will users actually understand the correct usage of the new annotations, or will annotations be used wildly and without a consistent meaning?

Unclear semantics will be very likely to sneak in, for we already know such problems from the category system. However, the category system also demonstrates that practical use can tolerate such sloppiness very well. We will probably not be able to create a high quality "ontology of the world" but a machine-accessible Wikipedia with all the inconsistencies and problems that the current human-only content has.

In any case, this problem calls for careful measurements during introduction: educate coordinators of Wikiprojects like Wikispecies about a proper usage, so that there are good examples from early on (even before most people use annotation). Provide documentation and guidance in parallel to the introduction. Most people use new features by imitating and reading the talk pages, not by reading the documenation -- so we should prepare convincing arguments for what we consider proper use of the typed links.

Expressiveness of annotations is too weak

Fact X within Wikipedia can not be expressed in the current proposal. What now?

We do not want to make everything in Wikipedia accessible to machines. There must be a trade-off between usability, performance, and expressivity. Even if we can only use annotations in the limited areas where they have been called for so far, it would still be a huge gain for most users. For more expressive solutions, we have to monitor current developments in the field. Using a standard format will hopefully allow us to adapt future extensions of the standard without too much effort (if desired!).

Modelling of time

Wikipedia deals with great amounts of historical data. But the annotations we give (like "Berlin is capital of Germany") do not have temporal restrictions.

This will remain an open issue: we would need to annotate typed links or data-values with time spans to fully solve the problem. However, this is not really supported by todays ontology languages, and thus we could not even use a standard exchange format in this case. Furthermore, entering such time spans would require much more complicated Wiki-source.

For the beginning, it might be helpful to include (very coarse) temporal information in link types and data-fields. E.g. one can have a relation "was capital of" or a data-field "former capital" in addition to the current ones. This allows some useful queries ("What were the capitals of Germany and what is the current one?"), and we do not lose functionality by implementing these things. A more advanced support of times is a very difficult problem (often, one does not even have a clear time span for some fact to be true) and it is reserved for future extensions -- and research.

Dealing with irregularities

Many annotations that are meaningful for most articles in some area cannot be given for some exceptional cases, or require more than the single foreseen annotation in this cases. Take e.g. the three sizes of France.

True, and we will not be able to plan all of these exceptions beforehand. Now if every country needs a single number as its size, we run into problems with France. But giving three values for the size would be useless for most other countries. It is also unclear what a list of countries sorted by size should look like in general: do we want to exclude France completely, do we give some "typical" size of the parts that one usually means when speaking of this country, do we create single articles for the different "Frances" (in fact, there might be such articles already)?

These things will arise often, and one has to be able to handle them somehow. A general guideline would be to think practical: if someone asks for a list of countries by size, what answer would be best (in this case, France should probably pop up three times, with labels that indicate what each entry refers to).

Another example: Population numbers generally are integers and they are understood as being rather imprecise. However, in some problematic cases, one only has a range for the value, with some minimal and maximal estimates. What should now happen if someone wants a list of the hundred largest cities? Should such cities be excluded because we cannot give a precise value? In this case, the community should probably agree on some realistic value and give this in the data-field. After all, the annotations are only an approximation of the Wiki-content; they can never capture the whole complexity of affairs. But they can still be useful.

In such cases, it would be nice to be able to give a comment on the value for the data-field "population", stating that it is only an estimate or that it is outdated. The proposed exchange formats like OWL/RDF support free-text comments on statement, so this would be a feature to consider.

Units of measurement

Giving data values usually is not enough, since many annotations specify a physical quantity that must be related to a unit of measurement. How can this be taken care of?

Normally, one would be tempted to associate the intended unit implicitly with the property that a value belongs to. For example, a property like the declination of a planet is most likely measured in degrees. However, in many other contexts, many different types of units are possible, sometimes due to considerations of scale (e.g. days vs. years) and sometimes due to different systems of measurement (e.g. Celsius vs. Fahrenheit). There are several problem in such cases:

The exchange format will usually only store a single numerical value. The intended unit must be communicated to the users of the annotation export.
The editors will want to enter some data value in a particular unit and not care about calculating the proper value in some other unit first.
When returning search results or other data that was gathered automatically on Wikipedia, values for cerain properties will have to be printed. This requires to include the unit and sometimes to give a value in different units (e.g. "km" and "miles").

The first problem is not so hard: a data property can include the intended unit in its definition. Any user of the annotation export has to know about the definition anyway, in order to get the name of the property etc. This unit information can be implicit (i.e. given only in human-readable text), since programmers can build it into their code for using the values. The second problem is far more tricky. There are two principle solutions:

Associate each property with a unit and create new properties for new units (e.g. "length (cm)" and "length (inch)"). Then the user can choose which property is in accordance with the numerical value provided. Problems occur when relating properties that are present in many unit-variants: to sort the articles after "length", we must compare the values of "length (cm)" and "length (inch)"; in some cases, users may have given both within the article: does this mean there are two possible lengths or is it just the converted value? (Usually, rounded values will not agree with the exact conversions, so we cannot detect this.) Similar problems arise for external programs working on the data, and we may not be able to use standard tools directly. All in all, this solution seems to lead to huge hacks in any software that works with our annotations.
Continue to store the value of some property only in one predefined unit but build conversion features into the parser to allow the user to specify the value in different units. This solution is nicer from a architectural viewpoint: unit conversion is handled initially and the exported annotations are clean and unambiguous. Unfortunatelly, it requires us to write support for all kinds of units. This problem can be alleviated by starting only with very common units of daily use, since users in scientific areas have often more standard units (chiefly metric system) and might be less reluctant to convert them manually. To enable automatic conversion of inputs like "29in" or "42cm" into a target unit like "m", the system has to know about these units. It would be feasible to build this into the data type. E.g. a data type "lenght (m)" given to properties could specify that (a) the numerical value is stored in the XML-datatype "decimal", (b) the numerical value in XML refers to "m" as the default unit, (c) the parser has to expect all kinds of units of length in the Wiki-source of this porperty. Again, it is clear that such datatypes generate additional implementation effort, but create a cleaner architecture and annotation.

Google is pretty good at unit conversion and parsing. Maybe we could just use that to get some unified representation?

Luckily, our current solution allows rather arbitrary usage of units, even if they are unknown to the MediaWiki software. Using Google might be helpful but possibly raises some legal issues (I guess using Googles services internally would be allowed, but it might require us to praise Google somewhere). It also creates some external dependencies: does Google have localized unit-identifiers? For most languages? Will the syntax of the service remain unchanged? Finally, it reduces processing speed, since an additional http-request is issued on every data-value before displaying an article.

The third point above can also be solved in this way: datatypes already imply the "main unit", but they can also define what additional units to dispaly conversions for. Unfortunatelly, no easier solution seems to be feasible. Note that there will still be data values and types without any unit, and that these can always serve as a fallback if the community agrees on one implicit unit to use. By accepting values without any unit even for data that has an associated unit and conversion scheme, one can gradually introduce units for formerly unsupported types as well, without having to rewrite the Wiki-source where the annotations are made.

I regard it as a bigger problem that we don't have a syntax for datatypes yet...

This is now fixed. In the current implementation plan, only a syntax for assigning data types to attributes is still missing.

Things will get even more complex - please read and comment my articles on Ontoworld for further details: Units, Math
Thanks, MovGP0 22:01, 12 November 2005 (UTC)[reply]

Wiki markup clutter

For beginners wiki markup is already often hard to understand. Adding a whole array of new tags will be confusing for many and decrease readibility. Obviously this concern only applies to mainstream wikis like Wikipedia and not to specialist wikis that are semantic from the outset.

Information Redundancy

The example "Berlin is the capital of [[Is capital of::Germany]]" makes it fairly obvious: Semantic information is redundant to any existing human-readable information. So perhaps semantic information should be added non-invasively on a separate layer on top of the current markup instead of within it. That way a large portion of it can be autogenerated and only cases where this is not possible have to be addressed manually.

Project Semantic MediaWiki.

This article is associated with the project Semantic MediaWiki.