Research:NLP Tools for Wikimedia Content

Tracked in Phabricator:
Task T316941
20:07, 13 October 2022 (UTC)
Duration:  2022-09 – 2023-06

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Many research projects involve using Wikimedia textual content as data -- e.g., training language models with Wikipedia articles, studying how an individual's gender identify affects how they are written about, finding sentences in need of citations for tools. Preprocessing Wikimedia content to use for research and development can be quite impactful on quality / accuracy of the final research product but can be quite difficult, especially when the researchers are working with unfamiliar languages or are not experts in wikitext syntax and structure. Wikimedia content generally requires special pre-processing to remove syntax and isolate the actual text of articles. Once that text is isolated, Wikipedia for example exists in over 300 languages, which is many more languages than most open-source NLP tools support. For better-resourced languages, Wikimedia content is written in a specific style that does not necessarily make general open-source NLP tools a good match for processing it. As a result, there are not good, shared standards and open-source libraries for processing Wikimedia textual content.

The vision of this project is:

  • Researchers could start with a Wikipedia article (wikitext or HTML), strip the syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
  • This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language.
  • Each component would be a Python library that is easily configurable but provides good default performance out-of-the-box.

Current stateEdit

There are good existing libraries for converting wikitext to plaintext (edit types library) and converting HTML to plaintext (HTML dumps library). While both can see continued improvement and streamlining, the basic functionality exists. We have explored various approaches for sentence and word tokenization but our code is scattered across projects, still has known gaps, and has not been well-tested in many languages.

Potential ApplicationsEdit

  • Structured Tasks
    • Copy-edit: need to split article into individual sentences to feed into model.
    • Add-a-link: split articles into sentence to capture appropriate context for each word in the model. Split sentences into words to know which tokens to evaluate for links.
    • Citation-needed: need to split article into individual sentences to feed into citation-needed model.
  • Metrics / Analysis
    • Edit types: summarize what was changed by an edit on Wikipedia – e.g., # of sentences/words added.
    • Readability: extract sentences to identify the number of entities per sentence as a proxy for readability.
    • Quality model: number of sentences as a better proxy for amount of content than bytes.
  • Extraction
    • TextExtracts Extension: return first k sentences in an article.
    • HTML Dumps: extract plaintext from article HTML – eventually might be nice to feed output into sentence tokenizer for even more control.
    • Vandalism detection: feature generation would benefit from word tokenization and likely sentence tokenization as well.


Sentence TokenizationEdit

Word TokenizationEdit