Abstract Wikipedia/Updates/2022-11-04

Abstract Wikipedia Updates

One Ring, or a thousand flowers?

One of the Abstract Wikipedia workstreams is focused on the natural language generation tasks that will be necessary for creating and maintaining Wikipedia articles in hundreds of languages. Unlike the other workstreams, this work is not focused on the immediate future and launch of Wikifunctions, but explores the next steps necessary once Wikifunctions is available and connected to the other Wikimedia projects, particularly Wikidata and Wikipedia.

In previous newsletters we have talked about some of the approaches and work around natural language generation for Abstract Wikipedia: Mahir Morshed talked about Ninai and Udiron, we talked about Grammatical Framework, which has been a major influence for the development and design of the project, Ariel Gutman and Maria Keet have presented the a template language specification (accompanied now by a Scribunto implementation), and had a recent update on Diff.

With these three solutions, are we done? Do we now know what the solution to natural language generation will look like?

I don’t know. It might be. These solutions have been developed by very smart folks, with a cumulative decades of experience under their belts. The most important goal that these implementations do right now is to provide an existence proof, they demonstrate that a solution is possible. They show us that the goals of Abstract Wikipedia are not too lofty. Grammatical Framework did that for Abstract Wikipedia as a whole. I can genuinely say that without Grammatical Framework, the Abstract Wikipedia project as it is wouldn’t exist.

But are any of these solutions the approach that Abstract Wikipedia will ultimately take?

I don’t know. One major novelty is that Abstract Wikipedia would benefit from being able to scale to a large number of contributors with very diverse skill levels. Some might be experienced programmers, some might be trained linguists, others might bring native language level skills. Which solution really scales well for a community of volunteer Wikimedians? This is very difficult to predict in advance. And this is why I don’t want us yet to commit to a specific solution.

I would like to see a Cambrian explosion of possible solutions. This is one of the reasons why Wikifunctions allows for all kinds of functions, why it is explicitly Turing complete: so we don’t lock ourselves prematurely into a single architecture, into a single solution. I am looking forward to a large number of different approaches being tried out, and then having the community building around these approaches discuss the advantages and disadvantages and also simply vote with their feet, through activity.

Yes, in the end we should make sure that we unify on a single solution. It would obviously be a tragic mistake if the natural language generation for Bengali would work entirely differently than the one for Hausa, using different abstract contents. But sometimes it might be necessary to develop some morphological or grammatical functions which are unique to a specific language, and then integrate them into the overall architecture for generating whole texts. Examples for that are the noun classes in Niger Congo languages, or the morphology of Arabic and Hebrew interleaving vowels and consonants.

I see the community entirely taking the lead on which solution to choose, implement, and pursue. But I see that happening mostly implicitly, through the community’s actions, and not so much by the explicit means of debating and voting on a single solution. I don’t want us to prematurely decide on a single way, but rather to stay open and invite experimentation and new ideas. The space of possible ideas is so vast, and the benefit of choosing a solution better fitted to our community is so big, that it makes sense for us to be creative.

I expect the development to go through four different levels of evolution.

First level: we might start with simple lookup tables. We might have a function that returns a short description for a city, selecting the relevant word or phrase in the right language, just based on the language. There are 550 items in Wikidata that have the English short description “city”, more than 1,700 items with “scientific paper”, and more than 400 with the French short description “article scientifique”. Even with such a simple look-up table we could already improve the descriptions of thousands of items on Wikidata.

Second level: we can extend the possibilities widely by allowing for arguments in a templated structure, say in order to make a short description such as “city in Azerbaijan” or “city in Israel” (each fifty occurrences), or “French author” or “Argentinian chemist”. Such simple patterns will both be useful for a large number of items as well as uncover already a surprising number of edge cases. This will be useful for Wikidata descriptions, but also can be already useful for Wikipedia articles: many bots such as Rambot on English Wikipedia or LSJbot on Swedish have been working exactly like this.

Third level: we add constructors to the templates. The constructors allow us to build whole articles from individual sentences. Instead of having model articles for a whole category, we now allow for manually written articles. This makes the task at the same time easier and harder: because the constructors now need to be reusable, they need to be more like modular sentences than whole articles.

Fourth level: as with the third level the number of constructors will grow, we should aim to rein in the amount of work that needs to be done by the smaller language communities. This can be achieved by having abstract renderers for constructors: winning an award can then, instead of having a direct template in English such as (the example is simplified)

“{person} received the {award} on {date}”

have an abstract (i.e. language independent) renderer such as (again simplified)

“Clause(subject=person, predicate=Q76664785, object=award, time=date)”

which in turn has language dependent renderers, but fewer of those. This would often lead to less idiomatic results.

Some of the solutions so far fare really well with the second or third level, and others seem also to be capable of dealing with the fourth level. The third level lends itself well to a certain kind of user experience, which the fourth level does not. There will be advantages and disadvantages to balance.

The goal of this newsletter is not to prescribe such a development. Unlike with the development plan, we are not going to actively work on making these levels happen. We do not have to go through these particular levels, and we don’t have to be uniform about the levels in the different domains. In some areas, the first level might be entirely sufficient, in others we might flourish by using ideas described for the fourth level, and again others might simply not fit into the described levels at all. And that’s OK.

The goal of this newsletter is to explain my thinking and decisions, show how the different systems and approaches we have previously mentioned fit together, and to allow for rational predictions of where we are going and what kind of contributions we are looking for. This is also an invitation to all of you: the NLG system will be developed by all of us together.

(Thanks to the volunteers and the team for the lively discussion on this essay, particularly Ariel Gutman, Maria Keet, David Martin, Mahir Morshed, and Aarne Ranta. The opinion reflected here is particularly Denny's.)

Volunteer corner

edit

Next week, on Monday, November 7, 18:30 UTC, we are going to host our next volunteer corner. You can join us here: https://meet.google.com/evj-ktbq-hbn

Staff editing discussion is closing

edit

Giving activity has calmed down, we are planning to close the staff editing discussion soon.

New developer channel

edit

We will use the #wikipedia-abstract-tech^connect channel on IRC (also bridged to Telegram) as a space more focused on developers and technology around Wikifunctions and Abstract Wikipedia. Our channels are documented here: Abstract Wikipedia#Participate

Development update for the week of October 28, 2022

edit

Experience & Performance:

Avoidance of type expansion in orchestrator (T297742)
Removed Work Summary component
Aligned on what fields are mandatory and what fields are optional during ZFunction and ZObject creation
Finalized designs for Publish component
Fixed more FE bugs
Completed another round of testing

Meta-data:

Initial implementation of Readable summaries of all error types (T312611)

Natural Language Generation:

Shared document on UI and grammaticality judgments
Made progress on the template language
Demo of possible template creation GUI made (github)