Wikipedia Abstrak/Usulan arsitektur sistem pembangkitan bahasa alami

This page is a translated version of the page Abstract Wikipedia/Natural language generation system architecture proposal and the translation is 25% complete.

Usulan oleh Ariel Gutman

Dokumen ini menggambarkan arsitektur yang diusulkan untuk sistem pembangkitan bahasa alami (PBA) untuk Wikipedia Abstrak. Ketika mempertimbangkan suatu arsitektur sistem PBA, pertimbangan berikut harus dipertimbangkan:

Modularitas: sistem sebaiknya modular, yang berarti berbagai aspek-aspek dari PBA (contohnya, aturan-aturan morfosintaktis dan fonotaktis) bisa diubah tanpa saling bergantung.
Leksikalitas: sistem sebaiknya bisa mengambil data leksikal (terpisah dari kode) sekaligus bisa mengandalkan aturan-aturan pembuatan bahasa untuk menghasilkan data leksikal di sepanjang jalan (contohnya, menambahkan imbuhan -s untuk membuat kata benda jamak dalam bahasa Inggris).
Rekursifitas: karena sifat komposisi dan rekursif dari kebanyakan bahasa,^[1] sebuah sistem PBA yang efektif seharusnya bisa memanggil dirinya sendiri.

Dalam konteks Wikipedia Abstrak, pertimbangan lain juga perlu dipikirkan:

Ekstensibilitas: sistem sebaiknya bisa dikembangkan baik oleh pakar bahasa maupun kontributor teknis serta oleh kontributor non-teknis dan non-pakar, yang masing-masing mengerjakan bagian yang berbeda dari sistem.

Berdasarkan batasan-batasan di atas, tampaknya masuk akal untuk berasumsi bahwa satu fungsi Wikifunctions (= WF) saja tidak dapat secara efektif menangkap kompleksitas sistem PBA modular seperti itu. Jadi, beberapa fungsi seperti itu perlu dilibatkan, masing-masing bertanggung jawab untuk langkah yang berbeda dalam alur PBA.

Dalam desain WF saat ini, fungsi individu tidak dapat:

Memanggil fungsi WF lainnya.
Mengambil data dari sumber eksternal seperti Wikidata.
Mengubah beberapa kondisi global sistem.

Untuk mengatasi keterbatasan ini, dokumen ini mengusulkan sebuah alur PBA yang akan dijalankan oleh Orkestrator WF, yang tidak dibatasi oleh hal-hal tersebut. Selain itu, untuk memungkinkan kontributor non-teknis untuk berpartisipasi, diusulkan untuk menciptakan bahasa templat internal, yang dapat dijalankan oleh evaluator WF yang dibuat khusus.

Pendekatan alternatifnya adalah menghilangkan batasan desain evaluator WF demi membungkus seluruh alur PBA dalam satu fungsi WF (yang kemudian akan menggunakan fungsi-fungsi WF lainnya). Meskipun pendekatan ini akan mengubah beberapa aspek implementasi sistem (misalnya, orkestrasi alur akan dapat diubah oleh kontributor WF), sebagian besar konsep arsitekturnya masih akan tetap sama.

Pada akhir dokumen dicantumkan perbandingan singkat dengan pendekatan-pendekatan lain yang disarankan.

Architecture overview

As explained above, the full NLG pipeline cannot not be encapsulated within a single Wikifunctions (=WF) function, but rather must be run by the WF orchestrator, which would allow fetching data from external sources (in particular Wikidata), invoking different functions of WF (defined by contributors) and keeping the necessary state while doing so. The envisaged architecture is presented in the following diagram, where the dark blue forms are elements which would be created by contributors to Wikifunctions (rectangles) or Wikidata (rounded rectangles), while the light blue elements represent function or data living within the WF orchestrator, and thus not directly amenable to community contribution.

Let’s detail the steps:

Given a constructor type, a specific renderer is selected,^[2] and the data contained in the given constructor is passed to the renderer as its function arguments.
The renderer is basically a template: a combination of static text, and slots which can be filled with the Renderer’s arguments, lexemes from Wikidata, or the output of other renderers. Templates are relatively easy to understand and write, and thus the authoring of renderers will be accessible to non-technical contributors.
The output of the renderer is a dependency syntax tree (using for instance Universal Dependencies (UD) or Surface-Syntactic Universal Dependencies (SUD) formalisms)^[3] in which the nodes are non-inflected lexemes (identified by their lemmas), augmented with some morphological constraints. In practice the tree doesn’t need to be fully specified; in particular, static text doesn’t necessarily need to be part of the tree.
Relying on a language-specific grammar specification, the morphological constraints coupled with structure of the syntactic tree allow the inflection of the lemmas, according to the lexical data present in Wikidata, or using inflectional tables of the grammar specification. The output of this step is a linear sequence of text, minimally annotated with part-of-speech information (i.e. whether a word represents a noun, a verb, a preposition etc.).
At this step phonotactic constraints are being applied, applying language specific sandhi phenomena. These can include the selection of contextual forms (e.g. in English a/an) or contraction/crasis of adjacent forms (e.g. French de + le = du).
As a final clean-up step, spacing, capitalization and punctuation may need to be adjusted in order to render the final text to be stored in a Wikipedia article. This step can be modeled in a language-agnostic way, by using (language-dependent) annotations from the previous steps.

In the above architecture, there are three components which need to be curated by community members:

Templatic renderers - these make up the bulk of the needed work, as every constructor needs one templatic renderer per language (though re-use of renderers for parts of sentences is possible). Note that the term Renderer is used here in a narrower sense than in Architecture for a Multilingual Wikipedia. In the latter, the term Renderer refers to an end-to-end data-to-text function, while here we use the term Renderer to refer to a specific component of the NLG pipeline, namely a template. This is no coincidence, since in the above architecture, the other parts of the pipeline are relatively fixed, and don’t need constant curation by community members.
Grammar specifications - these would have to specify the relevant morphological features needed for each language, their hierarchy and how these manifest themselves via dependency relations. These specifications may either be stored as data in Wikidata, or as functions in Wikifunctions (to be decided). It is probable that the creation and curation of these grammars will require substantial linguistic and technical knowledge, but since they are created once per (human) language, this is deemed acceptable.
Wikidata lexemes - these will be curated as today, but it would be important that the features they use are inline with the grammar specifications of each language.

Structure of templates

Since the bulk of needed work by community contributors would be the creation of templatic renderers, it is important to make this task as easy as possible, and in particular avoid requiring any coding experience.

Similar to the Composition “language” in Wikifunctions, we can develop an in-house templating language.^[4] The templating language should allow specifying a linguistic tree (with UD annotations) over three types of arguments:^[5]

Static text
Terminal functions fetching lemmas from Wikidata, or creating lemmas on the fly from other arguments (e.g. numbers^[6]).
Other renderers

The templating language will have a dedicated evaluator module, called by the WF orchestrator. The latter will be responsible for passing the output through the various modules of the NLG pipeline outlined above.

Example

Let’s assume we have a simple Constructor conveying the age of a person:^[7]

Age(
  Entity: Malala Yousafzai (Q32732)
  Age_in_years: 24
)

To render such a Constructor in English, we will use a templatic notation similar to the following (being a Z14/Implementation type):

{
 "type": "implementation",
 "implements": "Age_renderer_en",
 "template": {
   "part": {
     "role": "subject",  # grammatical subject     
     "type": "function call",
     "function": "Resolve_Lexeme",
     "lexeme": {
        "reference": "Entity"
      }
    },
  "part": {
    "role": "root",  # root of the clause     
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "be"  # replace with L-id
      }
    },
  "part": {
    "role": "num",  # numerical modifier
    "of": 4,  # Part 4 (“year”)   
    "type": "function call",
    "function": "Cardinal_number",
    "number": {
        "reference": "age_in_years"  
      }
    },
   "part": {
    "role": "npadvmod", 
    "of": 5,  # Part 5 (“old”)
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "year"  # replace with L-id
      }
    },
  "part": {
    "role": "acomp",
    "type": "string",
        "value": "old"      
    },
}
}

Some of the syntactic roles (npadvmod, acomp) have in fact no agreement effect, so one can leave them out.

Structure of grammar

The grammar needs to include the following information:

What part of speech the language has.
What grammatical features are appropriate for each part of speech
(Possibly) a type hierarchy of the features
How do grammatical relations (i.e. dependency relations) interact with grammatical features and parts-of-speech.

Note that the first points can be inferred from the Wikidata lexemes available for a given language, but it would be useful to make them explicit as part of a grammar definition, which would also enforce/validate the Wikidata lexeme definitions.^[8] One could write such a validator per language as a WF function, which would then run on the Wikidata lexemes to mark if they are correctly annotated according to the language's schema.

As for the grammar relations, these can be encoded either as data in Wikidata or as functions in WF. Dependency relations can be implemented as unification of grammatical features of their nodes, one could implement each relation as a Composition WF function, using the Unify operator as a builtin function. For instance, a "subj" relation for English would be implemented as following (using short-hand notation):

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    Unify(noun.number, verb.number);
    Unify(noun.person, verb.person);
    Unify(noun.case, NOMINATIVE);

Note that implemented like this, the subj function is not a pure functional, since it affects its input arguments (and in fact the return value is not used, unless an error of unification occurs). To keep things simple, this special behavior would need to be supported by the function evaluator.

One may want to bundle together features which get unified together. For instance, if we observe that number and person are often unified together, we may define a sub-function such as the following:

agr(left, right):
    Unify(left.number, right.number);
    Unify(left.person, right.person);

Then we can redefine the subj relation above as following:

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    agr(noun, verb);
    Unify(noun.case, NOMINATIVE);

Modular design of grammars and renders

It is often the case that languages from the same language family exhibit some grammatical and structural similarities. One may take advantage of this phenomenon by defining a hierarchy of languages and language-families,^[9] and allow the NLG system to use dynamic dispatch to the most concrete implementation of a (sub)-renderer or a (sub)-relation.

Other approaches

To date, I'm aware of two other systems that have been proposed to handle the NLG of Abstract Wikipedia.

Grammatical framework (GF) is an established functional programming language intended to support multilingual natural language generation and understanding (see newsletter description). It has a thriving community of computer scientists, linguists and other enthusiasts who contribute to it.
Ninai/Udiron is a Python-based NLG system built by community member Mahir Morshed, which uses lexeme data from Wikidata and combines them using UD trees. The system has been built with the Abstract Wikipedia project in mind. Some interesting examples of constructors and how they are rendered can be found in the Ninai demonstrations.

While the two systems are different, they can be contrasted with the proposal outlined in this document along similar axis:

Both systems are geared toward converting relatively abstract & compositional semantic representations into grammatical structure and then text.
They require mastering some programming skills, be it a domain-specific language (GF) or a general programming language (Python).
The ordering of the words in the output text is determined by the entire NLG pipeline (e.g. adding a Question operator could change the word order in English).
Insofar as the grammar definitions are correct, the output is guaranteed to be grammatical

The proposal outlined in this document, on the other hand, is specifically intended to make it as easy as possible for people without prior technical knowledge to make contributions. This implies the following:

It can work with concrete, non-compositional, semantic representations (as the Age example above). This however does not exclude handling more abstract representations.
At the entry level, almost no programming skills are required to write templatic renderers. Knowledge of linguistics (in particular dependency annotations) can be useful to achieve grammatical output, and is necessary in order to write the grammar specifications themselves.
The ordering of words is determined by the templates themselves, and is not changed later in the pipeline.
Output can be ungrammatical, if a template has not been designed correctly.

Catatan kaki

↑ Pertanyaan apakah rekursi bisa ditemukan di semua bahasa telah melewati debat yang panas di tahun-tahun terakhir
↑ It may be useful to allow rendering constructor either nominally (e.g. “Marie’s marriage to Pierre”) or verbally (“Marie got married to Pierre”). In that case, more than one renderer per constructor would be needed.
↑ The SUD formalism is simpler and possibly more adequate for NLG tasks. Osborne & Gerdes (2019) provide a discussion of the shortcoming of UD. See also https://surfacesyntacticud.github.io/conversions/ for a comparison of the two formalisms. In either case, we may need to extend the set of dependency relations in order to capture some patterns required for NLG, such as pronominal cross-reference.
↑ The templating language could be designed to be "syntactic sugar" above the Composition language, and thus it could probably be run by the same evaluator as the Composition language.
↑ See the poster "Using Dependency Grammars in guiding Natural Language Generation" (A. Gutman, A. Ivanov, J. Kirchner, 2019) as well as the corresponding working paper.
↑ One can use Unicode’s Common Locale Data Repository (CLDR) library to render cardinals and ordinals in different languages, as well as other data types such as dates.
↑ In practice the age should probably be calculated from the birthdate, but for the sake of example, it is specified in the constructor. We may moreover envisage a dynamic constructor in which part of the data is calculated on the fly.
↑ Currently there is no consistency in the annotation of lexemes, even in a single language. For example, the form "has is annotated as "singular, third-person, simple present" while the form "is" is annotated as "third-person singular, indicative present".
↑ Depending on the needed granularity, one may use the existing hierarchical codes as defined in the ISO 639-5 standard, or alternatively rely on the existing language-hierarchy defined in MediaWiki.

[1] Pertanyaan apakah rekursi bisa ditemukan di semua bahasa telah melewati debat yang panas di tahun-tahun terakhir

[2] It may be useful to allow rendering constructor either nominally (e.g. “Marie’s marriage to Pierre”) or verbally (“Marie got married to Pierre”). In that case, more than one renderer per constructor would be needed.

[3] The SUD formalism is simpler and possibly more adequate for NLG tasks. Osborne & Gerdes (2019) provide a discussion of the shortcoming of UD. See also https://surfacesyntacticud.github.io/conversions/ for a comparison of the two formalisms. In either case, we may need to extend the set of dependency relations in order to capture some patterns required for NLG, such as pronominal cross-reference.

[4] The templating language could be designed to be "syntactic sugar" above the Composition language, and thus it could probably be run by the same evaluator as the Composition language.

[5] See the poster "Using Dependency Grammars in guiding Natural Language Generation" (A. Gutman, A. Ivanov, J. Kirchner, 2019) as well as the corresponding working paper.

[6] One can use Unicode’s Common Locale Data Repository (CLDR) library to render cardinals and ordinals in different languages, as well as other data types such as dates.

[7] In practice the age should probably be calculated from the birthdate, but for the sake of example, it is specified in the constructor. We may moreover envisage a dynamic constructor in which part of the data is calculated on the fly.

[8] Currently there is no consistency in the annotation of lexemes, even in a single language. For example, the form "has is annotated as "singular, third-person, simple present" while the form "is" is annotated as "third-person singular, indicative present".

[9] Depending on the needed granularity, one may use the existing hierarchical codes as defined in the ISO 639-5 standard, or alternatively rely on the existing language-hierarchy defined in MediaWiki.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]