Research:NLP Tools for Wikimedia Content

Tracked in Phabricator:
Task T316941
Created
20:07, 13 October 2022 (UTC)
Duration:  2022-09 – 2023-06

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Many research projects involve using Wikimedia textual content as data such as training language models with Wikipedia articles[1], studying how an individual's gender identity affects how they are written about[2], or finding sentences in need of citations[3].

One of the main challenges in working with textual content in Wikimedia projects comes from its multilingual nature[4]. For example, Wikipedia exists in over 300 different language editions. Common open-source tools for natural language processing (NLP), such as NLTK or SpaCy do not explicitly support most of these languages. This means that these tools do not perform well in seemingly simple tasks for languages in which commonly-used heuristics do not apply (such as the use of whitespaces to detect tokens). As a result, there are no good, shared standards and open-source libraries for processing Wikimedia textual content across the more than 300 languages.

Therefore, the vision of this project is:

  • Researchers could start with a Wikipedia article (wikitext or HTML), strip the syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
  • This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language.
  • Each component would be a Python library that is easily configurable but provides good default performance out-of-the-box.

Current state edit

There are good existing libraries for converting wikitext to plaintext (edit types library) and converting HTML to plaintext (HTML dumps library). While both can see continued improvement and streamlining, the basic functionality exists. We have explored various approaches for sentence and word tokenization but our code is scattered across projects, still has known gaps, and has not been well-tested in many languages.

Potential Applications edit

  • Structured Tasks
    • Copy-edit: need to split article into individual sentences to feed into model.
    • Add-a-link: split articles into sentence to capture appropriate context for each word in the model. Split sentences into words to know which tokens to evaluate for links.
    • Citation-needed: need to split article into individual sentences to feed into citation-needed model.
  • Metrics / Analysis
    • Edit types: summarize what was changed by an edit on Wikipedia – e.g., # of sentences/words added.
    • Readability: extract sentences to identify the number of entities per sentence as a proxy for readability.
    • Quality model: number of sentences as a better proxy for amount of content than bytes.
  • Extraction
    • TextExtracts Extension: return first k sentences in an article.
    • HTML Dumps: extract plaintext from article HTML – eventually might be nice to feed output into sentence tokenizer for even more control.
    • Vandalism detection: feature generation would benefit from word tokenization and likely sentence tokenization as well.

Components edit

The library (mwtokenizer) contains functionality for two core tasks: sentence and word tokenization.

Sentence Tokenization edit

Our sentence tokenization is based on a series of heuristics. The core of the functionality depends on building a good language-inclusive list of sentence-ending punctuation. There are many edge-cases, however, where sentence-ending punctuation does indicate a sentence and these must also be filtered out. These include more straightforward instances to detect such as the use of periods as decimal points but also the much harder detection of abbreviations. For this latter case, we opt to build a list of abbreviations based on words on Wiktionary that end with sentence-ending punctuation. We then filter that list based on how often a given possible abbreviation appears in Wikipedia with and without the punctuation. We only retain words that appear at least 10 times and have the associated punctuation 60% of the time (these are arbitrary thresholds but we find that they work well in practice).

To evaluate our approach, we use a dataset of single sentences that are translated into many languages from FLORES-200 and test how well our tokenizer can properly split sentences that have been joined together from this dataset.

NOTE: this strategy does not work for languages like Thai that do not use sentence-ending punctuation. A language model would be required for these languages.

Word Tokenization edit

Our word tokenization strategy varies based on the language. For whitespace-delimited languages like English, we split on whitespace and then do some additional cleaning around punctuation -- e.g., removing trailing commas but not splitting on punctuation within a word or stripping punctuation if it's part of an abbreviation. For non-whitespace-delimited languages like Japanese, we train sentencepiece models to at least identify more reasonable sub-words to split on.

To evaluate our approach, we treat links in Wikipedia as reasonable indicators of word boundaries in that their anchor text may cover multiple words but a link is unlikely to start or terminate mid-word. We measure the performance of our tokenizer in two ways: how many tokens it takes to cover the anchor text of a link (fewer = better) and how often it generates tokens that cross link boundaries (fewer = better).

References edit

  1. Deckelmann, S. (2023, July 12). Wikipedia’s value in the age of generative AI - Down the Rabbit Hole - Medium. Down the Rabbit Hole. https://medium.com/freely-sharing-the-sum-of-all-knowledge/wikipedias-value-in-the-age-of-generative-ai-b19fec06bbee
  2. Park, C. Y., Yan, X., Field, A., & Tsvetkov, Y. (2021). Multilingual Contextual Affective Analysis of LGBT People Portrayals in Wikipedia. Proceedings of the International AAAI Conference on Web and Social Media, 15, 479–490. https://doi.org/10.1609/icwsm.v15i1.18077
  3. Redi, M., Fetahu, B., Morgan, J., & Taraborelli, D. (2019). Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia’s Verifiability. The World Wide Web Conference, 1567–1578. https://doi.org/10.1145/3308558.3313618
  4. Johnson, I., & Lescak, E. (2022). Considerations for Multilingual Wikipedia Research. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2204.02483