- The goal of Abstract Wikipedia is to generate natural language texts from an abstract representation of the content to be represented. In order to do so, we will use lexicographic data from Wikidata. And although we are quite far from being able to generate texts, one thing that we want to encourage everyone’s help with is the coverage and completeness of the lexicographic data in Wikidata.
- Today we want to present prototypes of two tools that could help people to visualize, exemplify, and better guide our understanding of the coverage of lexicographic data in Wikidata.
The first prototype is an annotation interface that allows users to annotate sentences in any language, associating each word or expression with a Lexeme from Wikidata, including picking its Form and Sense.
You can see an example in the following screenshot.
Each “word” of the sentence here is annotated with a Lexeme (the Lexeme ID
L31818is given just under the word), followed by the lemma, the language, and the part of speech. Then comes, if selected, the specific Form that is being used in context — for example, on “dignity” we see the Form ID
L31818#F1, which is the singular Form of the Lexeme. Lastly, comes the Sense, which is assigned Sense ID
L31818#S1and defined by a gloss.
At any time, you can remove any of the annotations, or add new annotations. Some of the options will take you directly to Wikidata. For example, if you want to add a Sense to a given Lexeme, because it has no Senses or is missing the one you need, it will take you to Wikidata and let you do that there in the normal fashion. Once added there, you can come back and select the newly added Sense.
The user interface of the prototype is a bit slow, so please give it a few seconds when you initiate an action. It should work out of the box in different languages. The Universal Language Selector is available (at the top of the page), which you can use to change the language. Note that glosses of Senses are frequently only available in the language of the Lexeme, and the UI doesn’t yet do language fallback, so if you look at English sentences with a German UI you might often find missing glosses.
The goal of this prototype is to make more tangible the Wikidata community's progress regarding the coverage of the lexicographical data. You can take a sentence in any written language, put it into this system, and find out how complete you can get with your annotations. It's a way to showcase and create anecdotal experience of the lexicographic data in Wikidata.
Corpus coverage dashboard
The second prototype tool is a dashboard that shows the coverage of the data compared to the Wikipedia corpus in each of forty languages.
Last year, whilst in my previous position at Google Research, I co-authored a publication where we built and published language models out of the cleaned-up text of about forty Wikipedia language editions.Besides the language models, we also published the raw data: this text has been cleaned up by the pre-processing system that Google uses on Wikipedia text in order to integrate the text in several of its features. So while this dataset consists of relatively clean natural language text; certainly, compared to the raw wiki text — it still contains plenty of artefacts. If you know of better large scale encyclopaedic text corpora we can use, maybe better cleaned-up versions of Wikipedia, or ones covering more languages, please let us know.
We extracted these texts from the TensorFlow models. We provide the extracted texts for download. We split the text into tokens and count the occurrences of words, and compared how many of these tokens appear in the Forms on Lexemes of the given language in Wikidata’s lexicographic data. If this proves useful, we could move the cleaned-up text to a more permanent home.
A screenshot of the current state for English is given here.
We see how many Forms for this language are available in Wikidata, and we see how many different Forms are attested in Wikipedia (i.e., how many different words, or word types, are in the Wikipedia of the given language). The number of tokens is the total number of words in the given language corpus. Covered forms says how many of the forms in the corpus are also in Wikidata's Lexeme set, and covered tokens tells us how many of the occurrences that covers (so, if the word “time” appears 100 times in English Wikipedia, it would be counted as one covered form, but 100 covered tokens). The two pie charts visualize the coverage of forms and tokens respectively.
Finally, there is a link to the thousand most frequent forms that are not yet in Wikidata. This can help communities prioritise ramping up coverage quickly. Note though, the progress report is manual and does not automatically update. I plan to run an update from time to time for now.
Both prototype tools are exactly that: prototypes, not real products. We have not committed to supporting and developing these prototypes further. At the same time, all of the code and data is of course open sourced. If anyone would like to pick up the development or maintenance of these prototypes, you would be more than welcome — please let us know (on my talk page, or via e-mail, or on the Tool ideas page).
Also, if someone likes the idea but thinks that a different implementation would be better, please move ahead with that — I am happy to support and talk with you. There is much to improve here, but we hope that these two prototypes will lead to more development of content and tools in the space of lexicographic data.
- Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou: Wiki-40B: Multilingual Language Model Dataset, LREC 2020.