Research:University of Virginia/sort a billion documents

Contact

???

University of Virginia

researcher

???

University of Virginia

faculty advisor

???

archive

advisor

Raf Alvarado

University of Virginia

faculty advisor

Lane Rasberry

University of Virginia

Wikimedian

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The Internet Archive is a digital library of websites, other digitally native media, and traditional media in digital form. Since its establishment in 1996 it has been one of the most popular web destinations. Its nonprofit mission includes providing the most library services to as many people globally as possible.

Problem

A simple and imprecise automated process has selected one billion documents in the collection of the Internet Archive which seem to be about some sort of scholarly research, whether in the sciences, humanities, medicine, business, or any other discipline. Some of these have metadata and for many of these there is digital text transcription. The lack of standard data among items in this collection is a barrier to further analysis and information discovery. These documents will need sorting many times and in many ways, but for now, sort them into these groups:

This document is scholarly research, including
1. academic papers from journals
2. white papers
3. preprints or unpublished drafts of research
This document talks about research or technical topics, but is not scholarly
1. casual essays
2. high school student research reports
3. journalism

Objective

Accomplish the sorting in any way that seems appropriate. Subobjectives could include creating a database of metadata, splitting the collection into categories to sort subsets in different ways, and compiling appropriate test datasets to use for data modeling.

Timeline

Late September 2020

Proposal presentation

May 2021

Project ends

Data

https://archive.org/

Deliverables

Research Proposal
Data Product
Technical Paper
Research Poster
Slides
Presentation of research
video presentation?
essay on ethics?
method documentation?

Research Team

Contact

For general Wikimedia questions please contact Lane Rasberry, Wikimedian at the University of Virginia, rasberry virginia edu, user:bluerasberry