Proposal for an Encyclopedian Recycling Endeavor

Today, August 26th (2001, apparently), marks the completion of a distributed endeavor to copy into Wikipedia, articles from a 1911 encyclopedia that someone had thoughtfully digitized and placed into the Project Gutenberg archives. This endeavor had been suggested by someone back when wikipedia first started, and I had taken the initiative to get the first 20 or so items posted in (starting, of course, with the letter 'A'.) 20 items was just a drop in the bucket though; check out Alan Millar/Status to see them all. Thankfully folks like Alan, with lots more endurance than I were available to continue posting these articles, and today I stand amazed that we completed it! Even though it was only one volume, that's still a lot of articles!

In my opinion, these articles greatly augment Wikipedia, with necessary data that is unlikely to just "happen" to be entered by visitors. Consider, for example, Alphonso_X_of_Spain, a medieval spanish king. Certainly worthy of mention, both as a world leader and because of his early involvement in astronomy. But would an entry on this fellow just happen to show up through normal Wikipedia processes? Maybe, but probably not. Needless to say there are thousands of moderately important people like Alphonso who *should* be listed in Wikipedia, yet most likely *won't*; at least not anytime soon.

I don't mean to disparage Wikipedia, quite the contrary. Wikipedia has a number of strengths that the proprietary encyclopedias will likely *never* have. Of these many strengths, let me choose just one for elaboration: Timeliness. Let me diverge a bit on a little example.

This week a new planetoid-thingee was discovered out in the comet belt. Very important scientific discovery, but let's say within a few weeks your daughter needs to write a report on it for high school, and needs more in-depth info than available in those terse CNN news items. That hard-bound dead-tree encyclopedia might have some useful articles on asteroids and the solar system, and probably will only be a few years out of date, but it certainly won't have anything useful on this newly discovered planetoid. Fortunately you bought your daughter a new computer today and it came with a digital CD ROM encyclopedia. Unfortunately, due to space constraints, this encyclopedia's asteroid article is extremely terse (though it does have a photo of an asteroid, but it's copyrighted with full legal protections of course). And since the CD ROM was published months before the new planetoid thingee was discovered, you're not likely to find it there either. You decide to try an online encyclopedia, yet these appear to be just an online version of the CD ROM you bought; maybe the pay-to-view site can afford to stay more up to date, but you've paid for two encyclopedia's, and aren't too thrilled about paying for another. Surely there must be another option...

Your daughter giggles at you. "Silly, just go to Wikipedia!" You do so, and she looks at the Recent Changes page to see if anyone's been keeping up to date with the news. Sure enough: Just today (the 26th of August, only a few days after the discovery), folks have been busily posting away on all manner of astronomical topics. Asteroid is rather terse, but at least includes mention of that planetoid thingee, "asteriod found in 2001, identified as 2001 KX76". Asteroid/Talk includes some extra interesting info not yet incorporated into the main article. Trans-Neptunian object, Kuiper Belt, Planet, Near-Earth asteroid, Solar system, and Planet X have also been updated (or newly added). Clicking around, she's able to find much more information on astronomers, other recent discoveries, and historical information to fill in her report. She's even able to make contact with some other students and teachers interested in this newly found body, and thereby learn of further sources of information on it, available through the web. When she is finally done and turns in her report, she decides to also post it up to Wikipedia, as article 2001 KX76. :-)

Wonderful! Wikipedia comes to the rescue and serves its role in the passing of knowledge to those who need it.

But you notice two things that are kind of odd. First, of course, there's very few photos in wikipedia, but that's a whole 'nother topic of discussion. At least there's a photo of Galileo on his page. Second, and perhaps more importantly, many of the supporting articles seem to be rather terse. For example, you compare your proprietary encyclopedia's lengthy dissertation on Pluto with wikipedia's dinky Pluto entry. Saturn is not much better. Charon doesn't even exist... Hmm...

Timeliness may be a strength of wikipedia, but depth may be its weakness. Certainly we can expect better articles to come from the planets; after all they're big and will always be there and new things are likely to be discovered about them. But what about other, older topics? Say you needed to know about the origins of astronomy in the 13th century. Luckily there's that aforementioned article on Alphonso X of Spain, but what of historical figures with names starting with B-Z?

The 'A' encyclopedia was digitized by hand, by someone who happened to have a 1911 edition on hand. Digitization is a lot of work, but it can be done; Project Gutenberg's been at it for years, and when you think about it, they're not so much different, organizationally, than we are.

So here is my proposal.

I think we could turn our distributed, collaborative talents and processes towards a mini-Project Gutenberg endeavor to digitize and copy into Wikipedia a full set of out-of-copyright encyclopedias.

I think in the interests of practicality and to make distribution of efforts a bit simpler, we may want to allow variation in years... We could have the 1911 A, 1922 B, 1909 C, etc. I think if we allow this, it eliminates a lot of need for coordination. There will of course be variability in where one volume stops and another starts, and we might have a few articles slip through, but if those articles are important, we may "pick them up" through usual Wikipedia evolution.

There's several steps we'd need to take:

a) First, we would need to determine when a rough cut-off date for copyrights is. Is 1911 the only year open to us, or could a 1920 or 1930 edition be used?

b) Next, we need folks to keep an eye out as they go about their lives, for some old encyclopedia sets. Look in grandparent's closets and bookshelves, a dusty corner office in an old college, book-heavy garage sales, and used bookstores. The set needn't be complete, but it's important that the print quality be good enough to scan. See if you can buy or have one or more of the books.

c) Now, even if an encyclopedia is in the right date range, we still need to pause and verify that the particular edition in hand *is* in the public domain. Copyright law can be complicated, and wikipedia *certainly* doesn't want to take the risk doing anything that could risk a lawsuit from a jealous encyclopedia company some day.

d) I think the easiest and fastest way to scan an encyclopedia volume is also rather destructive: Tear off the cover, break the binding, and cut the pages into loose-leaf, then run them through a scanner. I bet a multi-page feeding scanner would letcha get through a bunch of volumes at once. I suppose one could justify the ruining of an antique book in the knowledge that it's probably near the end of its life in paper form anyway and is bound for a new and even more meaningful life in an electronic form. Note that by leveraging the US Postal System (or fedex), this step need not necessarily be done by the same fellow who did step B. :-)

e) Next is the hard part: Proofreading. But maybe this step could be skipped, if the scanner is good enough. I've noticed that with the volume 'A' articles, spelling and format correction is quick to occur when the article appears in wikipedia. So maybe this step could be just a quick QA to ensure the page isn't garbled and in need of re-scanning.

f) With the article digitized, the next step is to get it into wikipedia. This is the step we already know how to do very well, so nothing more need be said. Judging by how quickly articles have been submitted lately, I'm guessing someone has developed a tool or process we could reuse.

Steps b-f can proceed in parallel; someone in New Jersey could be working on volume 9 of a 1919 encyclopedia, while someone else does volume 21 of a 1912 edition. Coordination can be done peer-to-peer, as folks ask who is working on what, and can see from what's *not* in wikipedia what still needs to be done.

BryceHarrington

I agree that this is a very important project, and should definitely be done. I am pretty sure that 1911 is the newest public domain version of EB; it is also a very good one (I believe EB was sold to the US afterwards and for a while not much material was added), so sticking with it would be fine.

I think that if we indeed get the scanning project going, we should also donate it to Project Gutenberg, since they gave as volume 1. This means we release it in the public domain I guess.

However, I'm wondering if Project Gutenberg is scanning the other volumes right now. Does anybody know who scanned volume 1? Those people should be contacted. --AxelBoldt

The rest of the 1911 Encyclopedia Britannica is available on CD-ROM as image files. See http://www.classiceb.com/ So the physical scanning part is already done (no destruction of books required :-), and all that remains is the OCR. The files could be then given to Project Gutenberg and also used here. --Alan Millar

May I suggest we try doing something the "Christian Classics Ethereal Library" has done -- you set up a website with the scanned in image of each page, and a text-area for people to transcribe it into. Or if you OCR it in, the OCR won't always be perfect, so you can set this up so people can correct the OCR'd text from the original image.

The problem with using Wikipedia to correct things is that it is unlikely to result in an exact transcription of EB, which is what Project Gutenburg would want.

Finally, I might note that a lot of texts that are public domain (such as EB1911) in the US may still be copyrighted elsewhere, due to past differences in copyright law -- but then so long as Wikipedia is located on a US server, that shouldn't be a problem.

Also, http://www.classiceb.com claims their scanned-in images are copyrighted. As per Public Domain Resources/Talk, they are (probably) wrong -- mere scanning is not of sufficent novelty to create copyright. But they still might cause legal hassles -- so maybe we better just scan it in ourselves.

-- Simon J Kissane

They claim copyright on their images and CDs, but not on the text itself; maybe we should just contact them to ask about OCR. They'll probably allow it but won't allow that we put up their images on a web site for people to grab and OCR. Everybody who wanted to participate in the OCR effort would have to buy their own CD set, at $109 a pop. --AxelBoldt

If you don' want to rip a volume apart you can do what I did reasonably successfully with a lot of the Catholic Encyclopedia articles I scanned - photocopy the page, then scan it, ocr, then proof-read it. Of course, I was avoiding working on my dissertation and had free access to copying machines... --MichaelTinkler

Text from: http://www.classiceb.com/faqs.html

However, while the original text of these older editions of the Encyclopedia Britannica are in the public domain, the ClassicEB® CDs are copyrighted by ClassicEB.com in all respects except for the original text. This means that purchasers of any ClassicEB® CDs may not copy our CDs in any way under penalty of violating copyright laws. Users may, of course, print off on paper any or all images contained on the CDs, in unlimited amounts, without violating ClassicEB.com's rights or copyright laws.

If you own your own physical set of the public domain editions of the Encyclopedia Britannica, you may scan your set onto CD and offer them yourself without being in violation of copyright laws. But you may not copy the images contained on the ClassicEB® CDs, nor may you use in any way the manual or any of the index tools contained on the CDs which were designed by ClassicEB.com. Violators of ClassicEB.com's rights will be prosecuted.

They claim its copyrighted, but I doubt that the images in fact can be copyrighted, since they contain no novelty. (Unlike photographs, there is no creativity in the arrangement, selection of view or lighting.) Just because they say it is copyrighted doesn't mean it is. Though of course, them suing us (even if they lose) wouldn't be fun. -- Simon J Kissane

One other thing they probably can't sue you for is re-encoding their scans in a new format, reorganizing them with your own indexes, and burning your own CD from that. Even if you take the raw data from their scans rather than rescanning the paper yourself, you're probably within the scope of Feist. I had planned to buy their CD anyway, just so I can do some manual edits on what are clearly some bad OCRs of the original. When I get the CD I'll do some back-of-the-envelope estimates on what it would take to recode and reburn it for us. --LDC

One thing nobody has mentioned is that a lot of things have changed since 1911, leading to a lot of misleading ideas in articles scanned directly. A disclaimer at the bottom of the page helps, but knowing that it's an old text doesn't fix everything, and it can be hard to expunge all the wrong data. Articles on physics, for example, would need to be carefully reviewed by somebody fully up-to-date, lest we bring relativity back into debate...

Even things like the article on King Alphonso might now be considered apocryphal nowadays - it's my understanding that history gets revised now and again (even by non-revisionists :))

I think this is an issue with all content in the Wikipedia, not just entries from old encyclopedias. It is being discussed significantly in a number of different places on the site here. (Before you can expunge wrong data, you have to define "wrong".) Not to pick on anyone in particular, but I don't perceive any substantial science behind the entry for Pheromones, for example.

But that just goes back to the basic scheme of Wikipedia: It's not Nupedia, and it isn't intended to be. It's a free-for-all, and anyone can fix anything. --Alan Millar

Yesterday, I just found the complete 1911 Encyclopaedia Britannica at my favourite used bookstore, only $180 (Canadian). I don't have the funds for it at the moment (and my scanner is broken), but I'm keeping it in the back of my mind...

About a month ago when in Urbana-Champaign, Illinois, I saw a set for US$ 125.

Working on a project like this gets me seriously annoyed with the current state of copyright law. I have a World Book Encyclopedia set from the 1970s that I can't even give away. The company that makes World Book won't make a cent off of this material, but because of the copyright, we can't use it for Wikipedia; instead we can only reuse material that hasn't been current for almost a century. *sigh* -- STG

I've actually got a paper set of the 9th and 10th edition EB, dating 1870-1900 or so. Many of the articles are so old that they might as well be rewritten from scratch, but some seem worth putting up for editing... the volumes are in too good a condition to contemplate wholesale destruction as exhorted above, so I'll tend to be adding shorter articles typed by hand on random subjects. At the very least, it'll help fill in details on obscure topics (and I was impressed by the speed at which extra information was filled in on the Fifth monarchy men) -- Malcolm Farmer

My opinion is that timeliness is one of our strengths, as those things that interest us get added. I expect that almost every subject will be covered in the next twenty years, and about half of the information will change as well. If we remember that this GREAT project started in January 2001, then compared to that other work, we are accomplishing a great amount. Gutenburg is trying to get Public Domain works in to electronic form, we are creating a new work, which I feel is a different and (in my opinion more satisfying] thing. The Old Stuff is a good start, but those of us living have been creating and discovering a great amount of New Stuff, which needs documentation as well.--mike dill

Why don't you (for some value of you that is representative of Wikipedia) simply talk to Project Gutenburg and suggest an affiliation between the two projects? They overlap in certain ways, most especially in certain goals and means. It would be a disaster to fork the effort of yet another project ala Wikipedia-and-FOLDOC. -- SunirShah

It might be more efficient in the long run to rekey the information into wikipedia entries by hand unless the intent is to establish precisely correct and attributed frozen articles for reference. Even in this case much hand cleanup is typically required for OCR.

It would also potentially provide a way for enthusiastic contributers without a lot of specialized knowledge to participate very heavily very quickly. If funding mechanism s get put in place then some funds could be used to buy appropriate out of copyright or public domain works and ship an appropriate volume to a volunteer. Any body want to tackle the dead sea scrolls or the Illiad?

Just out of curiousity I am going to go check out bible..... surely there must be an out of copyright version of the King James around .... Book of Morman ... Koran in English??? Perhaps these would be better in a controlled environment such as nupedia intends to provide. A draft from Wikipedia however could save a lot of expensive receptionist time typing. OTOH Commercial endeavers generally have revenue to compensate people for work so perhaps we should not be putting typists out of work.

So confusing! I better go browse philosophy for a while. -- user:mirwin

p.s. If this materializes let me know. I will find something interesting and educational to type on a few hours a week when my muse discovers how little I actually know sometimes.