Research:University of Virginia/sort a billion documents

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


The Internet Archive is a digital library of websites, other digitally native media, and traditional media in digital form. Since its establishment in 1996 it has been one of the most popular web destinations. Its nonprofit mission includes providing the most library services to as many people globally as possible.

Problem edit

A simple and imprecise automated process has selected one billion documents in the collection of the Internet Archive which seem to be about some sort of scholarly research, whether in the sciences, humanities, medicine, business, or any other discipline. Some of these have metadata and for many of these there is digital text transcription. The lack of standard data among items in this collection is a barrier to further analysis and information discovery. These documents will need sorting many times and in many ways, but for now, sort them into these groups:

  1. This document is scholarly research, including
    1. academic papers from journals
    2. white papers
    3. preprints or unpublished drafts of research
  2. This document talks about research or technical topics, but is not scholarly
    1. casual essays
    2. high school student research reports
    3. journalism

Objective edit

Accomplish the sorting in any way that seems appropriate. Subobjectives could include creating a database of metadata, splitting the collection into categories to sort subsets in different ways, and compiling appropriate test datasets to use for data modeling.

Timeline edit

Late September 2020
Proposal presentation
May 2021
Project ends

Data edit

Deliverables edit

  1. Research Proposal
  2. Data Product
  3. Technical Paper
  4. Research Poster
  5. Slides
  6. Presentation of research
  7. video presentation?
  8. essay on ethics?
  9. method documentation?

Research Team edit

Contact edit

  • For general Wikimedia questions please contact Lane Rasberry, Wikimedian at the University of Virginia, rasberry virginia edu, user:bluerasberry