Abstract Wikipedia/Template Language for Wikifunctions/Transformation of template syntax to other formalisms

The template language is intended for use within the scope of Abstract Wikipedia, but one may interpret this flexibly. For instance, perhaps you already have a realiser for your language that is not easily transformable into Wikifunctions or Scribunto yet the output is intended to be a Wikipedia article, or you’d wish to use the first stages of NLG (Wikidata + constructors + templates) but the output is intended to be a multilingual quiz for your own mobile app rather than a Wikipedia article. This should not be infeasible. A key step to realize this is the transformation of the template language into another realiser or (part of) an NLG system, such as Grammatical Framework and SimpleNLG, among many available systems. Each such system has its own way of specifying grammars or templates with more or less grammar associated with it in some way, and algorithms to interpret them that provide the intended meaning algorithmically. For this reason, it may not be possible to map the Template Language in a lossless manner to another template specification language or similar formalization. Rather the idea is to devise a transformation of the template syntax such that the intended semantics is preserved as much as possible without contradicting each other.

Transformation to ToCT edit

We use the “Task ontology for Complex Templates” (ToCT)[1] for this illustration. An ontology has its own semantics (model-theoretic or graph-based, in case of OWL ontologies represented in the web ontology language OWL) and therefore an algorithmic interpretation is not needed. What is needed is how to map those elements of the template language, taking into account the constraints, into ToCT elements.

Mapping to ToCT elements edit

Comparing them first informally, we observe the following. In contrast to the template language that has text, slots, and functions, ToCT has words and slots as key elements and ignores functions and sub-templates. Dependency relations with roles in the template language have a generalized counterpart in ToCT with a more flexible (underspecified) ‘relies on’ relation that is, at present, not based on (but partially inspired by) dependency graphs in the UD/SUD sense. Conversely, ToCT has certain elements for word fragments, such as concords and affixes, to make things easier for Niger-Congo B languages, which the template language does not have as first-order elements in the language. Each ToCT element is identified by their auto-generated part name that is user-modifiable, and thus can match those identifiers of the template language. The language one can select for the template language depends on what Wikidata supports, whereas this is more flexible with ToCT that uses the Model for language annotation, MoLA, of Gillis-Webber, Tittel & Keet (2019),[2] which allows not only any ISO 639 code but also user-defined (spoken or dead) languages, dialects, and other lects.

Since it is not an exact 1:1 mapping, it cannot be lossless and it needs a human-in-the-loop, in either direction. A possible procedure from the template language into ToCT is as follows, with the template language introduced above denoted with TL and ToCT with ToCT.

  1. Expand the template by unfolding each subtemplate into the main template such that there’s only one level for the template. This includes the possible process of subtemplate variant selection based on some precondition that may be specified.
  2. Expand the template by unfolding each subfunction that functions as a subtemplate into the main template such that there’s only one level.
  3. Transfer language attribute of the TL template: either copy over the ISO 639 language code and use the system’s generated name for it or add another name for it and indicate preferred/alternate.
  4. Index the order of TL elements from start to end, starting from 1.
  5. Group successive TL elements by word formation, omitting contractions due to phonological conditioning.
  6. Convert each immutable text string that intended to become a word from TL to To:
    1. If it is an unimorphic word, use the “Unimorphic word” element and transferring the index number.
    2. If it is a phrase (two or more words), use the “Phrase” element and transferring the index number.
  7. For each isolated slot that is intended to become a word or phrase, convert each TL slot to a ToCT slot, transferring the index number.
  8. For each combination of TL text and slots that is intended to become a word that is a polymorphic word, for each element, determine:
    1. Determine what the element is: a concord, an affix, root or stem (some text string), or a slot (i.e., something that will fetch content rather than linguistic data).
    2. Create a “Polymorphic word” and add those elements in sequence accordingly.
  9. Ensure that the ordering of elements in To is the same as in TL.
  10. Resolve TL dependencies to ToCT’s reliesOn. (Since TL dependencies may be more specific than the reliesOn, this is not a lossless process. ToCT’s reliesOn will be refined in the near future.)
  11. Transfer punctuation from the TL template to ToCT.

Examples in Turtle syntax edit

ToCT stores instantiations of templates in Turtle syntax - Terse RDF Triple Language, a W3C standard, for the serialization. (It could just as well have been the official exchange syntax for OWL (RDF/XML) or another one.) Due to the elaborate Turtle syntax that is not intended for human consumption but back-end processing, we only include two examples: the easiest template with Swedish and the most elaborate one for isiZulu.

The template syntax for the running example in Swedish:

Age_renderer_sv(Entity, Age_in_years): "{Person(Entity)} är {Age_in_years} år gammal."

The ToCT syntax is as follows, which will be explained afterward (the first part with @base and @prefix is just Turtle admin that can be ignored for now):

@base <http://people.cs.uct.ac.za/~zmahlaza/templates/owlsiz/> .
@prefix toct: <https://people.cs.uct.ac.za/~zmahlaza/ontologies/ToCT#> .
@prefix mola: <https://ontology.londisizwe.org/mola#> .
@prefix co: <http://purl.org/co/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix cao: <http://people.cs.uct.ac.za/~zmahlaza/ontologies/ConcordAnnotationOntology#> .

<ageSE> a toct:Template
    ; toct:supportsLanguage <lang>
    ; toct:hasFirstPart <Person>
    ; toct:hasLastPart <.>
    ; toct:hasPart <ar>, <ageYears>, <argammal> .

<lang> a mola:Language
    ; mola:isFamily <Swedish> .

<Person> a toct:Slot
    ; toct:hasLabel ""^^xsd:string
    ; toct:hasNextPart <ar> .

<ar> a toct:UnimorphicWord
    ; toct:hasValue "är"^^xsd:string
    ; toct:hasNextPart <ageYears> .

<ageYears> a toct:Slot
    ; toct:hasLabel ""^^xsd:string
    ; toct:hasNextPart <argammal> .

<argammal> a toct:Phrase
    ; toct:hasValue "år gammal"^^xsd:string
    ; toct:hasNextPart <.> .

<.> a toct:Punctuation
    ; toct:hasValue "."^^xsd:string .

The ageSE is the chosen name for the ToCT template and lang records the language. Person is a slot, just like in TL, where the label is left empty, since it will be fetched from Wikidata. It has as next part ar, which is an unimorphic word, whose label to be shown is är. Then, ageYears is a slot like in TL, and the last multi-word immutable part, called argammal, is a phrase. It ends with punctuation. There are no dependencies and no polymorphic words.

The template syntax for the running example in isiZulu was as follows:

Age_renderer_zu(Entity, Age_in_years):
"{subj:Person(Entity)} {root:SubjectConcord()}na{Year(Age_in_years)}."
Year_zu(years):
"{root:Lexeme(L686326} {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounConcord()}-{nummod:Cardinal(years)}"

The ToCT syntax (also explained below):

@base <http://people.cs.uct.ac.za/~zmahlaza/templates/owlsiz/> .
@prefix toct: <https://people.cs.uct.ac.za/~zmahlaza/ontologies/ToCT#> .
@prefix mola: <https://ontology.londisizwe.org/mola#> .
@prefix co: <http://purl.org/co/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix cao: <http://people.cs.uct.ac.za/~zmahlaza/ontologies/ConcordAnnotationOntology#> .

<PersonAge> a toct:Template
    ; toct:supportsLanguage <lang>
    ; toct:hasFirstPart <Person>
    ; toct:hasLastPart <punct>
    ; toct:hasPart <hasyears>, <isof> .

<lang> a mola:Language
    ; mola:isFamily <isiZulu> .

<Person> a toct:Slot
    ; toct:hasLabel ""^^xsd:string
    ; toct:hasNextPart <hasyears> .

<hasyears> a toct:PolymorphicWord
    ; toct:reliesOn <Person>
    ; toct:hasFirstPart <SC>
    ; toct:hasPart <na>
    ; toct:hasLastPart <unyaka>
    ; toct:hasNextPart <isof> .

<isof> a toct:PolymorphicWord
    ; toct:hasFirstPart <RC>
    ; toct:hasPart <cop>
    ; toct:hasPart <nounprefix>
    ; toct:hasPart <dash>
    ; toct:hasLastPart <yearno>
    ; toct:hasNextPart <punct> .

<punct> a toct:Punctuation
    ; toct:hasValue "."^^xsd:string .

<SC> a toct:Concord
    ; cao:hasConcordType <subjC>
    ; toct:reliesOn <Person>
    ; toct:hasLabel ""^^xsd:string .

<na> a toct:UnimorphicAffix
    ; toct:hasValue "na"^^xsd:string .

<unyaka> a toct:Slot
    ; toct:hasLabel ""^^xsd:string .

<RC> a toct:Concord
    ; cao:hasConcordType <relC>
    ; toct:reliesOn <unyaka>
    ; toct:hasLabel ""^^xsd:string .

<cop> a toct:Copula
    ; toct:hasLabel ""^^xsd:string .

<nounprefix> a toct:UnimorphicAffix
    ; toct:reliesOn <yearno>
    ; toct:hasValue ""^^xsd:string .

<dash> a toct:UnimorphicAffix
    ; toct:hasValue "-"^^xsd:string .

<yearno> a toct:Slot
    ; toct:hasLabel ""^^xsd:string

The explanations for the Swedish template are likewise in this case, with a few additions. First, there are more elements, such as Concord, and a PolymorphicWord. The principle of notation for the latter is the same as for the template: there’s an ordering with parts and each part has its entry providing further detail. For instance, the isof polymorphic word has a part RC that in its entry further below states that it is of type Concord, which type of concord it is (with hasConcordType), and that it reliesOn the part identified with unyaka (unyaka is a slot, for fetching the lexeme from Wikidata, but it also could have been a root where it would be assumed to be pluralised at a later stage if needed). This is equivalent to the “{root:Lexeme(L686326} {concord:RelativeConcord()}” part of the TL template.

Transformation to <add system here> edit

...

References edit

  1. Mahlaza, Zola; Keet, C. Maria (2021), AdeebNqo/ToCT, doi:10.5281/zenodo.5074931 
  2. Gillis-Webber, Frances; Tittel, Sabine; Keet, C. Maria (2019), A Model for Language Annotations on the Web, doi:10.1007/978-3-030-21395-4_1