Atualizações da Wikipédia abstrata

Atualizações da Wikipédia abstrata

O objetivo da Wikipédia abstrata é gerar textos em linguagem natural a partir de uma representação abstrata de um conteúdo. Para isso, usaremos dados lexicográficos do Wikidata. E embora ainda estejamos longe de poder gerar textos, uma coisa que queremos é, encorajar a ajuda de todos com a cobertura e completude dos dados lexicográficos no Wikidata.

Iremos apresentar protótipos de duas ferramentas que podem ajudar as pessoas a visualizar, exemplificar e orientar melhor nossa compreensão da cobertura de dados lexicográficos no Wikidata.

Interface do Analisador

O primeiro protótipo é uma interface do analisador/notação que permite aos usuários anotar frases em qualquer idioma, associando cada palavra ou expressão a um Lexema do Wikidata, inclusive escolhendo sua Forma e Sentido.

Pode ver um exemplo na captura de tela a seguir.

Cada “palavra” desta sentença é anotada com um Lexeme (o ID do Lexeme L31818 é dado logo abaixo da palavra), seguido pelo lema, o idioma e a classe gramatical. Then comes, if selected, the specific Form that is being used in context — for example, on “dignity” we see the Form ID L31818#F1, which is the singular Form of the Lexeme. Lastly, comes the Sense, which is assigned Sense ID L31818#S1 and defined by a gloss.
At any time, you can remove any of the annotations, or add new annotations. Some of the options will take you directly to Wikidata. For example, if you want to add a Sense to a given Lexeme, because it has no Senses or is missing the one you need, it will take you to Wikidata and let you do that there in the normal fashion. Once added there, you can come back and select the newly added Sense.
The user interface of the prototype is a bit slow, so please give it a few seconds when you initiate an action. It should work out of the box in different languages. The Universal Language Selector is available (at the top of the page), which you can use to change the language. Note that glosses of Senses are frequently only available in the language of the Lexeme, and the UI doesn’t yet do language fallback, so if you look at English sentences with a German UI you might often find missing glosses.

Technologically, this is a prototype entirely implemented in JavaScript and CSS on top of a vanilla MediaWiki installation. This is likely not the best possible technical solution for such a system, but should help to determine if there is any user-interest in the tool, for a potential reimplementation. Also, it would be a fascinating task to agree on an API which can be implemented by other groups to provide the selection of Lexemes, Senses, and Forms for input sentences. The current baseline here is extremely simple, and would not be good enough for an automated tagging system. Having this available for many sentences in many languages could provide a great corpus for training natural language understanding systems. There is a lot that could be built upon that.

The goal of this prototype is to make more tangible the Wikidata community's progress regarding the coverage of the lexicographical data. You can take a sentence in any written language, put it into this system, and find out how complete you can get with your annotations. It's a way to showcase and create anecdotal experience of the lexicographic data in Wikidata.

The prototype annotation interface is at: annotation.wmcloud.org.
You can discuss it here: annotation.wmcloud.org/wiki/Discussion (you will need to create a new account on that wiki).

Painel de cobertura do corpus

O segundo protótipo de ferramenta é um painel de controle que mostra a cobertura dos dados em comparação com o corpus da Wikipédia em cada um dos quarenta idiomas.

Last year, whilst in my previous position at Google Research, I co-authored a publication where we built and published language models out of the cleaned-up text of about forty Wikipedia language editions.^[1] Besides the language models, we also published the raw data: this text has been cleaned up by the pre-processing system that Google uses on Wikipedia text in order to integrate the text in several of its features. So while this dataset consists of relatively clean natural language text; certainly, compared to the raw wiki text — it still contains plenty of artefacts. If you know of better large scale encyclopaedic text corpora we can use, maybe better cleaned-up versions of Wikipedia, or ones covering more languages, please let us know.

We extracted these texts from the TensorFlow models. We provide the extracted texts for download. We split the text into tokens and count the occurrences of words, and compared how many of these tokens appear in the Forms on Lexemes of the given language in Wikidata’s lexicographic data. If this proves useful, we could move the cleaned-up text to a more permanent home.

A screenshot of the current state for English is given here.

Screenshot of Wikidata lexicographic coverage dashboard.

We see how many Forms for this language are available in Wikidata, and we see how many different Forms are attested in Wikipedia (i.e., how many different words, or word types, are in the Wikipedia of the given language). The number of tokens is the total number of words in the given language corpus. Covered forms says how many of the forms in the corpus are also in Wikidata's Lexeme set, and covered tokens tells us how many of the occurrences that covers (so, if the word “time” appears 100 times in English Wikipedia, it would be counted as one covered form, but 100 covered tokens). The two pie charts visualize the coverage of forms and tokens respectively.
Finally, there is a link to the thousand most frequent forms that are not yet in Wikidata. This can help communities prioritise ramping up coverage quickly. Note though, the progress report is manual and does not automatically update. I plan to run an update from time to time for now.

The prototype corpus coverage dashboard is at: Wikidata:Lexicographical coverage
You can discuss it here:Wikidata talk:Lexicographical coverage

Help wanted

Both prototype tools are exactly that: prototypes, not real products. We have not committed to supporting and developing these prototypes further. At the same time, all of the code and data is of course open sourced. If anyone would like to pick up the development or maintenance of these prototypes, you would be more than welcome — please let us know (on my talk page, or via e-mail, or on the Tool ideas page).

Also, if someone likes the idea but thinks that a different implementation would be better, please move ahead with that — I am happy to support and talk with you. There is much to improve here, but we hope that these two prototypes will lead to more development of content and tools in the space of lexicographic data.

Notas

↑ Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou: Wiki-40B: Multilingual Language Model Dataset, LREC 2020.