Copyright expiry acceleration is a set of projects focused on expanding and strengthening the public domain, by highlighting the compounding public benefit from (c) expiry, speeding the process, and removing ambiguity around works that are likely PD but held back by uncertainty. As of 2023, the US Library of Congress had digitized both the catalog of copyright entries and many of the more detailed copyright record books, facilitating this process.

Catalog of un-renewed PD works: US©RD + NRWB

There are many US works published before 1964 that should be PD-US-NR but have yet to be identified as such. A number of them (and of orphaned works generally) are available via fly-by-night POD services on sites like Amazon, but not recognized as PD and available in digital library catalogs.

The catalog of copyright entries (CCE) was fully digitized by IA in 2012, and OCR was cleaned up by NYPL and DCLab in 2019. Sean Redmond published an essay highlighting the ups and downs of determining PD status, and the sort of checklist needed to sift out the likely candidates. While this has been done for some sample collections, the full catalogs (~550k renewals, ~500k registrations with no matching renewal, across Books and Pamphlets classes) have not been mapped to library holdings, and there is no catalog of name-normalized PD works, with links to PD manifestations and scans or ebooks.

Elements that exist as of 2021:

a well-documented github repository for the NYPL project
an unofficial search engine for that metadata: CCE-Search,
a spreadsheet of likely-non-renewed works, derived from this (original name & author fields; not normalized / no linked IDs): FINAL-not-renewed.tsv
first attempt to match these works against the IArchive catalog.

Target

A public Wikibase [NRWB] w/ data on PD works, a catalog of resulting digital books, & insurance for catalog users.

integrated into an epub-generating pipeline, ready to integrate into a future bibliometrics corpus.

with estimates of the value of works entering the PD, and associated risks, for use in policy reform.

Goals

A dashboard showing

# of renewals not yet matched to a registration
# of registrations not renewed
likely-PD works, w/ confidence score
- likely-PD manifestations, w/ confidence score. (digitized Y/N, inserts reviewed Y/N, ebook Y/N)
focused subsets by class: serials/periodicals, books, pamphlets, plays, music, ...

A public catalog of PD-not-renewed works (w/ rough confidence and significance scores)

w/ a way to add PD covers + format-cleaned editions

Collective insurance for reusers (for the subset w/ non-zero uncertainty)

Estimates of the value of free public reuse & the risks of legal damages. [rendered !moot if LOC ever hosts a subset of the catalog]

Rough roadmap

A first pass. What's missing? What's already underway?

Convert CCE scans into tsv's (two parts: books + pamphlets [CCE-1], other classes [CCE-2]). (80% of CCE-1 is done; the rest needs doing)
1. These should link renewals to their registrations, and give each entry a normalized CCE-ID. (original reg. IDs are not unique)
Normalize names of works + authors, link to library records. [CCE-3]
1. For clarity, create a new tiny catalog for works w/ no known library record [CCE-cat]
Build a wikibase [NRWB] to store these + allow parallel projects to enrich CCE-3 w/ other facets (major ones in bold below).
1. Build a dashboard (via NRWB query service?) for curators to see what remains, by category
Evaluate PD status, digitization status, + significance of each work in CCE-3; assign confidence scores. [CCE-PD]
1. PD-confidence: strength of closest match in "unclear" renewals; likelihood of a previous foreign publication, &c. walk through this checklist
2. work-significance: popularity, novelty, length, existence in any known catalog
3. digitization status: if the work has multiple manifestations, note the latest that is PD + status of each. Steps should include OCR, text cleaning, review + removal of images + other inserts, ebook formatting (compare PGDP checklist)
For each work in CCE-PD, decide if it is ready for review; if so flag as PD-NR
1. Choose threshold of confidence + significance to review, based on capacity. Tune for low false positives.
2. For each entry: link to existing scan + status, or prioritize in scan queue.
3. Add to cleanup queues for scripted human review. (check capacity + interest: Wikisource, PGDP)
Add PD-NR to pipelines for design + distribution [books: PDBC; other works: ??]
1. Link w/ PD cover art + compilation. Add to existing DPLA search index + catalog (see their curation-corps)
Release and publicize
1. Share top-confidence works as PD-NR. (Dover crossover collab? Cross-visibility w/ DPLA's releases of old awesome gov docs?)
2. Review remaining likely-PD works, estimate social value + get collective insurance, share as PD-NR*
3. Release + promote PDBC as part of public domain day

Needs

OCR : Automated OCR was too noisy. Structured, clean OCR is on hiatus; 2/3 of the pages still need this. (NYPL and LOC both have plans on this front)

Law : Clearance of national + international rights; analysis of risk + sufficiency of curation/review of final PDBC works

People : Wikidata scripting; data cleaning; epub-generation scripting

Data + tools : OR scripts; PD cover generator for epubs

Time : natural milestones @ annual PD Days

Potential participants

Past efforts, winding down: Stanford CRD. Free book covers on WP. Recovering the Classics.

Past OCR efforts: Google generated their own in 2011; tools have improved since then.

Ongoing legal efforts: Durationator, Hathi research, case.law parallels (and interested law students)

Current efforts: NYPL OCR, code + data. UPenn serials. LOC digitization. Hathi manual reviews + catalog. DPLA ebook catalog. CS&S support.

Current WM work: Wikidata (UPenn), Wikisource?, Commons (PD-US-NR category; ~40k images on commons; PD Wikiproject)

Notes

Past + present work

This work depends on, builds upon, and extends the work of many researchers, scholars, and practitioners. The projects described below have made significant advances in extracting or using data from the CCE, producing public datasets that furthered the shared work.

Scans and transcriptions of book renewals

Project Gutenberg and Distributed Proofreaders (PGDP) is one of the first collaborations to transcribe the CCE. PGDP volunteers transcribed the renewal records for books and published the transcription online, but did not tackle records from other classes.

The Stanford Copyright Renewal Database (SCRD) allows users to search the CCE renewal records for books. The data comes primarily from the output of the PGDP project, but with some important advances. The SCRD project parsed the raw transcription from the PGDP project and created well-formed records. Parsing the data into the appropriate fields allows users of the SCRD to search specific fields and facet results.

Copyright analysis and rights clearance

This complements the University of Michigan’s Copyright Review Management System (CRMS), now operated by HathiTrust. CRMS is the first successful attempt to determine the copyright status of books at scale utilizing data from the CCE. CRMS relies on the SCRD to allow reviewers to find the presence or absence of copyright renewals for books that were digitized primarily by Google as part of the Google Books project. CRMS does not determine the copyright status of books that are not yet in the HathiTrust corpus.

Instead of reactively determining the copyright status of books, the copyright-expiry work we propose will help libraries to identify the books that are likely to be in the public domain. This would allow libraries to focus their resources on digitizing books that can be made widely available.

The Durationator project, originating from work at Tulane University, has published evaluations of copyright status and estimates of when copyright expires for a subset of works.

Scans and transcription of the full CCE

In 2011, Google scanned the CCE and made a searchable version available online, but the scans and OCR were not available as public domain works...

So in 2012, the US Copyright Office contracted with Internet Archive (IA) to photograph the CCE and made it available to the public. IA also used OCR to generate machine-readable transcription, but the transcription accuracy was not sufficient for detailed copyright research, and there was no reliable way to simultaneously search across all 674 digitized volumes. As a result IA’s digitized CCE was largely used similar to the analog version—'turning' the electronic pages to find the records sought.

Since 2018, the New York Public Library (NYPL) has endeavored to create a highly-accurate transcription and parsing of the CCE, with funding from the Institute of Museum and Library Services, the Ford Foundation and the Arcadia Fund. To date, NYPL has transcribed and parsed the registration records for Class A (books) from 1923 to 1977, representing more than 88,000 pages. NYPL has also reviewed an additional 41,000 pages and determined that those pages do not need to be transcribed or parsed because they are duplicative or were previously converted by the PGDP project.

Work on subsets of the CCE other than books

The University of Pennsylvania Libraries is also working to document the renewal records for serial publications. (see the related Wikidata pilot project) UPenn’s project interprets data from the CCE to produce a list of dates on which the first renewal could be found, while the proposed project will produce transcribed and parsed data that others will be able to utilize.

Copyright expiry acceleration

Contents