Talk:Abstract Wikipedia/Natural language generation system architecture proposal

Latest comment: 2 years ago by Sillyfolkboy in topic Formatting clean up step

Please feel free to add your comments and thoughts about this proposal below. AGutman-WMF (talk) 13:54, 27 May 2022 (UTC)Reply

Commas in example

edit

In the example Age constructor function, the last two "role" lines have no comma in them. Is this a mistake? Strobilomyces (talk) 18:09, 28 May 2022 (UTC)Reply

Yes, this was an omission, which I have now corrected. Thanks for noticing!
I should mention, however, that the example given is an illustration. I'm working on providing a more concrete specification of the format. AGutman-WMF (talk) 10:29, 30 May 2022 (UTC)Reply

Formatting clean up step

edit

Can some examples be provided for step 6 of the architecture overview. It states that punctuation and capitalisation can be dealt with in a language agnostic way, but many examples I can think of for punctuation and capitalisation have language-specific features. Sillyfolkboy (talk) 23:12, 28 May 2022 (UTC)Reply

Thanks for the question! I am planning to follow-up on this in a subsequent, more detailed, design document. My wording might be a bit misleading, but my intention was that the relevant module can be language-agnostic given the annotated output of the previous module. I'll clarify this in the text.
For the time being, let me shortly describe how I envisage this to work:
Spacing
Spacing can be specified directly in the templatic renderers (as part of static strings). In most cases, however, spacing is not specified explicitly, but rather the spacing module will add spaces between parts of the template, with some exceptions:
  • Some languages (e.g. Thai) don't use spaces between words. This information can be part of the annotations passed on to the module.
  • Affixed words (or clitics) as well as punctuation marks which are attached without spaces to preceding/following elements will be annotated as such.
Capitalization
In scripts where capitalization is used (mostly the Latin script), inherent word capitalization (e.g. in proper nouns) would be included in the templatic renderers (or even better, in the name fetched from Wikidata), so the real issue for the formatting module is to take care of sentence-initial capitalization. This can be achieved by having sentence-final punctuation marks be annotated as such.
Of course, the actual choice of the capital letter depends on the language itself (e.g. in Turkish the capital letter of i is İ). The formatting module would get the language code from the previous modules and could use the ICU code to do the capitalization in a language-sensitive way. The result would be language-sensitive but but our code would be language-agnostic.
Punctuation
Punctuation marks would also normally be specified in the templatic renderers themselves, either as text or as function invocations (if we want the choice of punctuation to be dynamic, for instance). As mentioned above, punctuation marks would also include annotations regarding spacing and capitalization.
The formatting module will mostly take care of removing stray or duplicate punctuation marks which might arise due to the design of the templatic renderer.
AGutman-WMF (talk) 12:59, 30 May 2022 (UTC)Reply
@AGutman-WMF: Yep, makes total sense now. I'm glad this was just a case of hidden extra thinking rather than anything missed :) Sillyfolkboy (talk) 17:59, 6 June 2022 (UTC)Reply

Placeholder templates

edit

I am currently interested in block based visual programming languages. I have written a program that is an expanded codification feature to turn blocks with gaps for the variables into code. For the definition of the blocks I have used Snap! and the programming language I used is R. The code for that is here. It reads a XML-File with the definition of the blocks and gives as an output the code, as defined in a CSV-File, where the blocks are mapped to the code. For often used standard senteces such a system could be interesting. If a sentence with the same structure was used before within Abstract text the program that generates the output can then look for the equivalent block and execute the needed code for the output language. So it is a kind of collection of often used combinations to process them faster. Currently I think that such a system could cover a lot of the often used sentences. For adding new senteces a mapping of the mentioned words with the used objects within the sentece is needed. So when Wikidata is the source the objects are the used Q-Items. This is my current view how the templates could look like. Do you think it is possible to cover with for example as an estimation 100 different structured senteces the most of what is used within the usual used texts.--Hogü-456 (talk) 12:25, 29 May 2022 (UTC)Reply

Sounds interesting! The templates would indeed be tied to constructors which would represent common types of sentences (or claims) which appear in Wikipedia. We don't know exactly how many constructors would be needed, but given efforts such as FrameNet a reasonable estimation is that to cover a substantial part of Wikipedia around 1000 constructors (paired with templates) would be needed. AGutman-WMF (talk) 13:22, 30 May 2022 (UTC)Reply

Unify operator

edit

@AGutman-WMF It is very good to see that you are advancing with the multilingual part of Abstract Wikipedia and I will be excited to see how it works out.

Please could you provide a link to a definition or introduction to the Unify operator? Am I right in thinking that it does unification in the sense of this WP page?

I can see that grammatical features like number and gender need to be made to agree in the lemma tree, but I think in practice it should always be clear what depends on what, so I think Unify is overkill and it would be much simpler just to use assignment. The direction of propagation of these features is known once the template is known, for instance in Germanic languages, it only makes sense to determine the number of a verb from the number of the subject, never the other way around. Strobilomyces (talk) 15:28, 30 May 2022 (UTC)Reply

Indeed, the page you link to gives a very broad and abstract definition of unification. In Linguistics, unification is mostly used as means to construct (and resolve) representations of linguistic features of a word, phrase, sentence etc. In this sense, you can find a good discussion of unification in Syntactic Theory: A formal introduction. Online, you can find articles by Peter Hellwig which discuss unification use in linguistics (1986:§7, 2003:§5).
As for your second remark, unification is more general than assignment exactly because you don't need to bother about the ordering of its application. Let's imagine you have a template telling us about the children of someone, of the general form The (son|sons|daughter|daughters|child|children) of X (is|are) <list>. In this case, the gender and number attributes of the subject itself are determined by complement of the verb, so the order of propagation of features (along the syntax tree) could be from the complement to the verb to the subject, rather than the other way around. Of course, given the template one may deduce this and handle the assignments in a particular way here, but using unification allows you to abstract away from these considerations. Or, think of a Swedish Noun phrase like det stora huset ("the big house"). In this case, there is a definite feature that is marked on the article, the adjective and the noun. In some cases, the definite feature may originate in the noun itself (e.g. it is mentioned in a previous context) and be propagated to the article (causing a selection of a definite article form), but in other cases, the template may specify a definite article and the noun gets the definite inflection because of that. Here, too, using unification allows you to abstract away from this question and have a single rule applied in both cases. AGutman-WMF (talk) 15:49, 30 May 2022 (UTC)Reply
Thank you for your prompt answer. I am not sure that I am ready to purchase the book on syntactic theory at present (as it would doubtless lead to other purchases), but I will think about it. I will look at the other links. Strobilomyces (talk) 21:11, 30 May 2022 (UTC)Reply
You probably don't need to buy the book just for this reason.
I've looked into the copy of the book I have at home and found the following remark (p. 83): "For an elementary discussion of the formal properties of unification and its use in grammatical description, see Shieber 1986." The latter is a text-book called An Introduction to Unification-Based Approaches to Grammar, available online. AGutman-WMF (talk) 09:22, 31 May 2022 (UTC)Reply
Thank you. Strobilomyces (talk) 12:09, 31 May 2022 (UTC)Reply
Return to "Abstract Wikipedia/Natural language generation system architecture proposal" page.