Open main menu

WikiCite 2018/Program/Building a WikiCite corpus



In WikiCite contexts, a corpus is a set of Wikidata entries that share some common characteristics, for instance having the same author, translator, language or topic, being cited from the same Wikipedia page or having been published from within a geographic region or within a specific period. In this talk, I will explore some examples of such corpora that have been or are being assembled on Wikidata and highlight how they can be used to improve data quality, data models, tools and workflows or simply to gather a deeper understanding of the relationships between elements of the corpus. These examples include corpora under the auspices of the WikiProjects Wikipedia Sources, Retractions, Invasive Species, Kākāpō as well as Zika Corpus and others.


Defining a WikiCite corpusEdit

Multiple approaches are possible here; I am just illustrating some.

Primary corporaEdit

  • things that have been
    • authored
    • published
    • cited
    • archived
    • used as a reference on Wikimedia platforms

Secondary corporaEdit

What to consider before getting started on a new corpusEdit

  • What is already there?
    • items
    • properties
    • What about lexemes/ forms/ senses?
  • How is it modeled?
  • What is the purpose of the existing and new corpora?
    • discovery
      • e.g. of knowledge, connections, potential collaborators
    • quality control
      • might involve
        • constraint statements
        • Shape Expressions
        • maintenance queries
          • for constraints, benchmarks etc. (some examples)
        • Scholia
          • see also next talk
    • research assessment
  • What about starting your corpus as a subset of one of the existing ones?


  • How does it related to past, present and future of WikiCite?
  • create new properties
  • revise data models
  • write Shape Expressions
  • build/ adapt tools and workflows
  • SourceMD

Additional considerationsEdit

  • Complete corpora are good for testing purposes, so watch out for
    • things that do not change (much any more), e.g.
      • all citations from a given version of a publication
      • publishers, journals, organizations, authors, countries etc. that do not exist any more
      • publications in extinct languages
  • Scholia for quality control


Daniel in 2017

Daniel Mietchen is trained as a biophysicist and now works for the Data Science Institute of the University of Virginia on opening up research and education workflows for large-scale collaboration, including with machines. More details via Scholia or this user page.