Note: As of January 2023, my affiliation with the Foundation ends, and I will not use anymore this account. I may continue contributing to Wikidata and the Abstract Wikipedia project through my personal account.
As part of my work as a Software Engineer & Linguist at Google, during May-December 2022, I had the opportunity to work as a contractor for Wikimedia through a Google.Org fellowship for Abstract Wikipedia, contributing especially to its Natural Language Generation (NLG) workstream. You can read more about this fellowship in a Diff newsletter announcement.
Prior to my work in Google, I did a Ph.D. in Linguistics in the University of Konstanz, researching Neo-Aramaic dialects. I published an open-access book based on my thesis.
Disclaimer: I worked for or provided services to the Wikimedia Foundation, and this is the account I used for edits or statements I made in that role. However, the Foundation does not vet all my activity, so edits, statements, or other contributions made by this account may not reflect the views of the Foundation.
The full-time fellowship engagement ended after six months by the end of October 2022 (see newsletter update) but I was able to continue to contribute on part-time basis till the end of 2022. I've summarized the course of the fellowship and my contributions in a goodbye letter which was published in the last Abstract Wikipedia newsletter of the year. For convenience, I reproduce it here:
Over the last six months, I've been part of the Abstract Wikipedia team as a Google.Org fellow. At the Foundation, my aim was to leverage my expertise in Natural Language Generation, which I honed from working on NLG at Google for over six years, to advance the Abstract Wikipedia project.
The first half of the fellowship was mostly dedicated to writing design docs: The architecture of an NLG system and a template language specification (the latter co-authored with Maria Keet, to whom I’m grateful). At the same time I was involved in other discussions, be it the quality of lexical data on Wikidata, or the form Abstract Content should take (many thanks to Kutz Arrieta for leading the latter discussion).
At the midpoint of the fellowship, I felt the urge to create something more concrete. Unfortunately, the Wikifunctions platform was not ready to serve as a solid development platform, so, per the advice of the Google.Org Tech Lead Ori Livneh, I set out to create a prototype NLG system on Wikipedia’s Scribunto platform, a Lua-based scripting environment embedded within Wikipedia.
To my great pleasure, the Scribunto platform, with its Wikidata API, allowed me to rapidly create a functional NLG system capable of transforming Abstract Content into text (see recorded demo or example output). The system is not yet exhaustive, however it contains the necessary components, outlined in the proposed architecture:
- An Abstract Content repository, allowing the specification of an article outline for individual Wikidata items.
- A Constructors repository, containing logic for auto-creation of abstract content for Wikidata items, depending on their types (people, places etc.).
- Templatic renderers which are templates specifying how each constructor should be verbalized in the different realization languages.
- Template functions written in Lua or in the template language, to be used within template slots. These in particular allow importing of Wikidata lexemes and their representation in an internal format, using dedicated helper modules.
- Morphosyntactic dependency relations written in Lua using a limited set of unification operators, allow specifying the flow of grammatical features between template elements.
- Phonotactic functions written on Lua allow specification of language-specific phonotactic rules (such as the a/an alternation in English).
- Text assembler taking care of constructing the rendered text, while adjusting punctuation, spacing and capitalization.
On top of these there are modules with the necessary logic needed to parse and evaluate templates, represent lexemes and unifiable features and interact with Wikidata. The main module controls the overall flow of the NLG pipeline.
My primary aim in developing this prototype was to substantiate the designs I've proposed, and provide example code for a similar implementation on Wikifunctions. In fact, if Wikifunctions will support Lua, the code can probably be reused as-is. The modules in the above bulleted list would become user-editable functions, while those mentioned thereafter could be integrated in the backend system of Wikifunctions, as they are expected to be relatively stable.
Yet, there is a second, more subtle aim. During my fellowship, I have grown skeptical of the premise that Wikifunctions is necessary to achieve the vision of Abstract Wikipedia. While user contributions (e.g., functions, renderers, or constructors) are necessary for its success, these should be NLG-oriented and they do not need a general functional platform such as Wikifunctions. By focusing on building an NLG-oriented system, the vision of Abstract Wikipedia can more rapidly be attained. (Being part of a fellowship, it maybe shouldn’t come as a surprise that I'm on the "One Ring" side…). Together with my colleagues Ori Livneh, Ali Assaf and Mary Yang I've put my viewpoint in detailed writing. I believe that the template-language proposal, implemented in this prototype, is the good foundation to build upon.
The Scribunto prototype shows that a platform more limited than Wikifunctions can already be used to generate articles from Abstract Content on real Wikipedias. It suffices to copy over the necessary modules to the target Wiki, and define the language specific renderers, functions and relations. Whether you agree with me or not, I invite you to play around with the system and edit the relevant modules to add functionality for your favorite language.
As my fellowship is ending, I would like to thank all my colleagues in Abstract Wikipedia's Natural Language Generation workstream, for the passionate discussions and ideas. In particular I am thankful to Cory Massaro, the Tech Lead of the workstream, for his guidance and confidence, and to Eunice Moon, my Google.Org colleague and Product Manager of the workstream, for her superb organizational skills.
- Proposal of an NLG Architecture for Abstract Wikipedia (see also newsletter announcement).
- Template Language for Wikifunctions, co-authored with Maria Keet (see also newsletter announcement).
- A prototype implementation in Scribunto of a templatic NLG realizer (see example realizations and overview documentation).
- I contributed to a Diff newsletter update about the NLG workstream's work (also available as an Abstract Wikipedia newsletter).
- I contributed to the discussion of the Abstract Content Representation.
- An evaluation of the Abstract Wikipedia project & architecture, co-authored together with three other Google.Org fellows: Ori Livneh, Ali Assaf & Mary Yang (reported in the Signpost).
- Guidelines for Hebrew lexicographical material
- Using Lexemes in Abstract Wikipedia - a Wikidata Quality Days 2022 Presentation (see also notes).
- Video about the template language for Wikimania 2022 (my contributions starts on minute 25:00)
- Overview of the proposed NLG architecture for Wikifunctions and exploration of UI ideas (together with Sandy Woodruff).
- A demo of the above mentioned Scribunto prototype.
- During the fellowship offsite, I helped to organize Maria Keet's talk titled "Knowledge-to-text Natural Language Generation for Agglutinating African Languages".
You can see a selection of my publications on one of these sites:
The full list of my publications can be found in my CV.
Offsite in ZurichEdit
At the end of August 2022, we had an offsite in beautiful Zurich...
|Users by language|