Wikipedia portal workshop
The Wikipedia portal workshop was a one-day event held on June 10, 2009 from 09:00 until 16:30 at Tufts University (Ballou Hall). The agenda was focused on ways to create a portal dedicated to material from the Million Books Project that could be useful to Wikipedia.
Brian Fuchs Dan Cohen David Bamman [db] David Mimno [dm] David Smith [ds] Francesco (student of Greg's working on language trees) Greg Crane, head of the Perseus Project Josh Greenberg, of the NYPL Mathias Schindler [ms] Maura Marx, of the Open Knowledge Foundation [mm, convenor] Mike Edson Peter Brantley, of the Internet Archive [pb] Samuel Klein [sj] Steve Ramsay [ a few others in attendance hardly spoke – Francesca from Italy, Marielle, Agnes, Doron W, and Allison from Tufts. ]
Key: (( … )) – Editorial notes from SJ [ … ] – aside or description of following section ?? – transcription uncertain
Greg Crane - The public sphere in an electronic ageEdit
...consider the difference between private exchange and the state.
http://en.wikipedia.org/wiki/The_Men_Who_Tread_on_the_Tiger's_Tail - why do Togashi's eyes close?
American Experience and references. How many sources are in the 1.5m books/works?
Let people use this powerful medium... as a start for extended research. This is the next point on the trajectory.
100 feet - looking at a place, building, movie. the infantry view.
35,000 feet- we want people to develop ideas fully... Pericles??
From Plato to NATO. Allan Bloom. Classics in the non-European world.
Who was the most important classicist in the 20th c? Khomeini. doctoral work on Aristotle, fundamental to Islamic law. when he organized the republic of Iran, he was reading Plato's Republic; and set it up as a philosopher king, with guardians (mullahs). You can argue with the holy city residents saying they are the true heirs of Athenian democracy.
To have this argument, you would start a very important dialogue. and then there is Khatami and the journalists. In "Dreams and shadows" is described a section on humanity's legacy. Khatami is talking about Plato's justice (and the journalists are bored, waiting for mil/political speech). We tend to have the wrong framework when talking with these [classics-based] cultures, which leads to bad chatter not meaningful discourse.
So how would you go about studying the impact of Plato in Islamic thought? if you don't know Arabic, Persian, Latin, Greek... [you must do without] the direct source. But given 100k Bibalex and 1M Internet Archive books, you could start to ask this question.
[What unifies cultures?]
Kandahar was one of the Alexandrias. Greek spreads to Pakistan. this unifies these culture. likewise the Roman Empire gives rise to the Arabic speaking world in North Africa and the middle east. How could you support this - discuss the cultural continuum from Rabat to Kandahar [at least a dozen modern languages there]. All share similar cultures- how do you think about what is shared?
[1a. How do we share discussion?]
How do we move beyond the English speaking European community represented here? how do you foster Arabic discussion? Deifying abstract concepts in Euripides : (Arabic language notes) U of Cairo was in the news with Obama's speech. here is their Classics dept. A single instance of a general question [connecting cultures]. I visited the library at U of C, and they live off of digital libraries. Lots of smart students but not a lot of money. They have Greek, Latin; but their mission and all is in Arabic. emphasis on scholarship of Alexandria; the Translation Movement from Greek to Syriac to Arabic Many Greek works survive b/c of translation into Arabic How do you get this story into English? [and the Arabic --> Latin reverse translations in the renaissance]
Tufts is working to establish ways for our students to work with theirs. depends on access to materials, software, and community driven environments. So people who speak Arabic can work with those who speak English
[1b. what do we need to move forward?]
Answer: collections, analysis and tech, new communities.
Transparency of collections. _DEMOS: Athenian democracy_ -- here is a tenure-book published for free online. The author wrote it with the assumption that all sources existed... and were available online for analysis. You can see all citations, popping up inline if you like. you could call up the linked contents, which points to the Perseus library texts. how does this related to Wikipedia where sourcing is a major topic?
Imagine translating this into Arabic; enabling a parallel Persian and Arabic discourse about the same text [and pointing to the same sources].
We want a world where everyone feels they can contribute every day if they are so moved. I think Wikipedia is magnificent. You see changes in academia in practice.
Consider: the Prague Arabic Dependency Treebank.
This provides a language genome. one of the most important new instruments [in language analysis] since print. Let us redefine/establish what we think we know and transform [programmatic] questions. bringing these ideas within reach of undergraduates.
ex: Iliad 6 tree analysis, with 2-3 students working on each sentence, later reviewed by a Greek expert -- new work. Their names are associated with this work.
ex: Homer's best manuscript of the Iliad. A diplomatic edition is done by a class. Most of the scholarship on this text have never been published. Students are now contributing edited notes and translations, with permanent meaningful output. when I graduated 30 yrs ago, the idea that your thesis would be interesting was uppity. You couldn't /do/ anything... this is maybe the generalization of the Wikipedia phenomenon within academia. so not just academics would use this; this is where you would get source materials to interpret that Japanese movie...
services. tools. what do we have now to connect people and collections? what do you want to do with them books?
- an automatic timeline and map. find dates and place names, scriptedly or by hand.
- language processing. language conversion.
The GaLe processing engine. published available tool; funding pushes this towards use in security. It can transcribe text / take text and translate / produce English or distill into summaries. "could be deployed right away to this [1M book] corpus"
closure : from Pericles' funeral oration - who can best provide an environment in which people can be most fully human... flexible to change with the times - I cant think of anything more important than that.
quality - how do we know all references are correct? perhaps the 15c document [from a WP article] is theory but not more than that. how do we know that this is the best/most accurate consensus that exists today?
a: that's why you need something like the Demos publication. he found that all works on Athenian democracy cited secondary sources, and all others also cited 2ry sources. it was impossible to actually see what foundation there was for a tendentious statement, from written scholarship. so he went aggressively back to the sources, linking propositions to the source. That is one of the most exciting things about open access and open collections. [every statement can be credited straight back to sources]
q: David M - problematizing things - there are other ideas that will be the result of automatic processes; most fields have not struggled with dealing with this in science : how do you review for veracity or correct and use as indirect evidence? there is suggestive correlation but not direct. and you are then relying on aggregates of more data than you will ever read?
a: one of the things we need to be able to do is 'read Wikipedia' critically. of course its no different than reading Time magazine.
a: (Mike) - reputation is important; who says what and the legacy of their contribution becomes as important as their skills and the data they have access to. knowledge system without reputation doesn't really support the goals we are after.
a: that is a shorthand as a way of saving time; you don't look at everything everyone says. For me it is more being able to look at the evidence. looking up the footnotes in widener... I want people to be able to do what i did as a firstyear; pulling books of the shelf, disagreeing with the expert; [only possible with access to source]
q: the challenge is how to build reliable material without basing it on reputation systems? what is the right amount of structure to produce something that can challenge an academic system? we were wrong too, we thought we had letters from Plato that were forged in them. Pages that had to be erased. You had to back off from the philosopher king system, and the Athenian democracy system which was considered a complete failure for most of western history; the American system was not based on it because it was so unstructured -- it was considered unable to produce intelligent decisions.
((what is the direct practical meaning of Demos on WP organization? --ed.))
David Mimno & David Smith - TechnologyEdit
DavidS, Language technologies in the wild. [NLTP and more...]
I am a prof now so I don't actually do - well I try to do things. But a lot of this is stuff other people do. David M for instance. Computers were designed to put tings into boxes. databases, xml... [ancient illustrated canon tables from Byzantium - not new :)]
The world also has. unstructured data around visual representations in some cases it is Latin with lots of abbreviations and references. otherwise not much difference b/t this and modern tech papers...
Language work so far. Email : spam processing. OCR identification. morphology and syntax
monolingual and multilingual dictionaries, cross-lingual IR [information retrieval], machine translation.
multilingual exploratory data analysis. clustering/classification/model building
'byl jasny studeny dubnovy dena hodinyy odbijelytrinactou'
it was a bright cold day in April and the clocks were striking thirteen...
from the CIA translations of 1984
u mass DL seminar projects. OCR error correction
book specific language models unsupervised font models.
time sensitive language topic models.
name dab linked to Wikipedia
better quotation detection; efficient n^2 doc similarity... [actually m^2 for # of clusters m]
OCR: data so far is 8-24% standard error rates. clustering around errors, meaning-bearing words more likely to be wrong than connectors.
NTS: 1) new font model from near neighbors.
Our project - we have 600k books. IA didn't send us a disk, they said 'hey, we have a website'. took us a month, 1 person flooding our I2 connection. 80% of them are 1900-1923, with 10-15k per year. then ~500k per year in recent decades from gov docs [downloaded the OCRed text; the images are still being downloaded, that's much larger.
q: are you providing a disk to others?
a: we'd like to think we don't have the bandwidth to satisfy all comers... but perhaps we should try
process: we look at existing OCR, then go back to the images online to help correct it; which seems to be new.
update on stat machine trans. 200m words, then parliament, then religious and econ. not much on what we really care about. lots of Chinese Japanese, Russian, German, English? I think..
Finally, once we have parallel texts, lots of monolingual tools : named entities, other identification, can be overlaid to connect words to words without having great direct translations. annotated data from sources such as WP with ref strings can be projected over other languages.
"there's no data like more data"
exploration <-> exploitation
research becomes infra. new data drive research. new models.
Someone (Steve??) mentions that the Sloan Digital Sky Survey noticed the largest structure ever found, over 1 billion light years across, which required so many data points it had been missed before. Mike asks for a cite.
(( announced by Princeton's Gott and Juri? in 2003, upstaging the CfA : http://en.wikipedia.org/wiki/Sloan_Great_Wall ))
Thanks to the organizers! Introducing myself - from U Mass Amherst?? Working with [the fabulous] Hanna Wallach and Andrew McC.
we are working in the tech tradition but also the classic tradition of understanding cultures. (Aside: if you look at the U Mass campus, classics and cs are at the opposite ends! I'm trying to understand what is in the middle. currently it's a parking garage.)
go into Tisch they have nicely set it up so you only see a small number of books, in your aisle. If you laid them all out it would be really impressive. What tech methods can we bring to books written by thousands over hundreds of years and make it accessible in a way that others can draw conclusions and use it, and understand what is in it?
As a computer scientist, my first reaction is, we have a lot of things that are similar to each other : try clustering. In news, for instance, there are articles about baseball games negotiations with players and unions; and with workers and unions.
try to take a low-dimensional space and model the high-dimensional space in it. Cluster topics instead of documents. Docs will be modeled as mixtures of word groups. For instance : 3 from sports; 1 from sports and 2 from labor; 1 from business and 2 from labor (for 3 different docs). What can you do with this?
- when we had 45,000 books, a project algorithmically created virtual shelves from the text of the 1913 book "A short history of the united states" based around word clusters. Call this an 'unsupervised method'.
- Multilingual text mining: scholarship and historical work in many languages is relevant. can we alter existing topic models to align across languages?
-- cluster words from documents that are translations of one another. In EU direct-translation corpus this works well. In Wikipedia with non-direct translations but topic-similar articles, it also works pretty well and (bonus!) you can choose more interesting topics...
[Square-area graphs showing the proportion of a Wikipedia about a particular topic]
You can see how comparatively popular a topic is across the entire corpus.
skiing : Finnish television : Finnish, less obvious. Ottoman/byzantine empires: Greek/Turkish/Persian
Took 20 JSTOR journals about classicism, from 1850 to mid-2000s. How did scholars perceive different texts? look at the Aeneid and Thucydides's History. The former has cyclic publication, the latter drops ow in 1910 and high in 1950. The latter discussion after the rise of fascism, all previous general discussions go away. power support, battle, empire,imperialism, alliances.
[finer granularity : what are people reading?]
Citation patterns are complicated. canonical works, books, chapters, lines. explicit refs are hard to extract. There are lots of variations, but some are visible.
Most cited passages look very different from 1910 to 1950.
1 - random selection of chapters 3, 1, 6, 2, 5. 2 - mainly chapter 2.13-15, other chap 2 and 1.
Other text mining for critical points in history : look at State of the Union addresses.
A really global view : you can see the change year to year over all clusters of words applied to the speech text to see how variable speeches are from one to another.
Q [josh]: what are the low hanging fruits, things to generalize from? I'm interested in on the ground populist public libraries. How to think about [what libraries can offer] in this context?
Mathias Schindler – Wikipedia and community creationEdit
A quick overview of Wikipedia and other Wikimedia work:
- download.wikipedia.org lets you download everything. this is important. 4tb of uncompressed text.
there is a statement that anyone who can watch any WP article should be allowed to edit the article. 2001++
- there are 21 local chapters. I have been a full time employee of Wikimedia Deutschland since the start of the year.
- We are nearing 3m articles in the English Wikipedia; less in other languages, though you can rearrange more into fewer ((and vice-versa)) based on how you divide articles into sections or subarticles.
The audience is broad - most internet users anywhere in the world know at least the term Wikipedia.
: we printed 20k German editions of a concise Wikipedia. The print run basically sold out; one of the most popular concise encyclopedias this year. There were also different editions e.g., for Bertelsmann book club, beyond the 20k. A nice demo of the goal of providing info to anyone [in various formats].
[Library integration of data]
We have an architecture for auto-linking ISBNs to authority files at national libraries. The German Wikipedia has a page about this [Mike asked for details] but maybe it needs translation into en.
(( see for instance http://de.wikipedia.org/wiki/Wikipedia:Normdaten and http://de.wikipedia.org/wiki/Hilfe:PND ))
Personendatei - how to build a system to match two kinds of records? See the Wikipedia-artikel/PND project. Next: how do you encourage volunteers to do the matching quickly? We did this [in Germany] by making media from the libraries available under a free license... something that was valuable to the community.
Library cooperation involved:
- read/write access to authority files. (( an audible gasp ))
- access to bibliographic records.
- improvements in how we reference literature [in Wikipedia]
- FRBRization of literature references
- augmenting records to include access to the actual work, and whether it is already digitized anywhere.
Aside: note that Open Library is scraping lots of databases, which may be a problem in countries with database-copyright laws. That data remains to be evaluated.
(( we may need to improve templates for referencing certain books, canonical literature, &c. Track edition type and other things beyond ISBN - Open Library id's? --ed. ))
Q [Greg]: Have you worked with academics? There is lots of material that spans many languages. Are there academic/public partnerships? We have material that is often not aggregated; for histories I would talk to Monica for instance. This material could be added to wikisource &c.
A: Yes. There have been 'Wikipedia academy' days. We bring in profs and discuss with them how to add new material and keep it up to date. Organizing these events takes as much time as talking directly to people. We have done half a dozen of these in Europe... mostly in Sweden, Germany, Israel. There will be one in the US soon, but we rely on bridge initiatives [including those by others].
((There's a wiki conference in New York July 25-26 : http://en.wikipedia.org/wiki/Wikipedia:Meetup/NYC/Wiki-Conference_2009 –Ed.))
Q [Greg]: The separation of WP from what people in universities do is something I would like to see most addressed. I am tired of hearing people say you cannot get credit for this (contribution to WP). We have to deal with this! The job of academia is to contribute constructively to the public sphere, of which this is a dynamic part. How do we get this going? The classicists have lots of tenured folks who can do anything they want. They can write letters for each other to solve the problem of critical mass of tenured faculty.
You have Germans, Italians, Croatians, Egyptians -- take a topic like the history of Greek science, which is a really hard one - and a lot is [only] in Arabic. Egyptian scholars might work in that environment. How do we get the Arabic, German, English speakers to create something that goes across these Wikipedias, contributes to wikisource, Wiktionary, WP, and has the accountable contributions of scholarly work to which you can point?
we have a flashpoint here, just as one example. Maybe just for me, in the backwater[sic] of academia where noone cares what we do... if today we move somehow towards this circuit closure, that would be historic.
((most inspiring rant all session --ed.))
Brian : 2 questions:
1 - how do you build the enabling [technical? interface?] platform, which I don't think exists?
2 - how do you motivate academics and researchers to contribute? Hard to do - it might work better here than in the UK, where money you get is tied to indexed citation publishing you do.
Greg: Forget the UK, do what works.. find people who will do the work. I have been doing this for 30 years. one year people say this is stupid, I wont do it, why are you insulting me by asking? The next year they are saying why didn't you let me join? They have forgotten what they said the year before. [and are just as indignant]
Q: one guy working in cuneiform was doing this - they used no platform, just figured out how to enrich Wikipedia and tie it to their database. The will is there.
Greg: we can define at least in one field, 'this is how you would get your promotion'. Counted just as traditional scholarly work. This is [in] philology, what we're used to getting tenure for : we could go from saying "do this out of philanthropy" to saying "this goes in your yearly report, counts towards your raise, etc." How do you get this into the incentive structure? We could do it. I hate to push classics, my field, but we at least are set [to do this right away]... the Italians might be inclined, they are more flexible than the British; Americans could; the Egyptians would. You have a real opportunity to open up a new paradigm and set of incentives for Wikipedia contribution.
Q [sl??] : one incentive might be to take a class, have them rewrite the entry; then pick the best one -- WP can choose from all the classrooms choose the best updates and incorporate that. Something along those lines, where people can improve things and get recognition for [contributing to] the best ones.
Greg : undergrad research is huge here. The effect on how people think and feel is so dramatic. I am teaching a course on [ ?? ] a north African person who fought the Romans. We should contribute to the WP entry or him... and we should contribute to Wikisource with the materials we have. maybe also our [linguistic] treebanks. we have all of this metadata.
(( we could also improve effective means of presentation of new outside contributions. ))
Mathias - it is currently easy to help without joining and using the "WP platform". just making data accessible is already helping. digitizing work and providing material under free licenses is helpful. Taking a favorite topic, looking it up on WP, and writing an article and putting it on talk pages...
Direct contribution [to articles] is not the only solution.
Greg : so we have undergrad theses. Then you have sky surveys once you have a few things like this showing up... colleagues are doing editing at U of Houston and U of Missouri at Kansas city which aren't typical classics places - the students really respond when doing real [meaningful, lasting] work.
There is an opportunity to solve and address the problem of what you do with [new] material that is there.
Q (Brian) - We are basically giving people the ability to create lots of assertions about books. Where does it go on WP? Do you have stats on how talk pages figure in generation of consensus content, and do they have an independent life of their own? Do people go there to look at the discussions?
MS: 99 of 100 requests go to WP articles, the remaining 1% go to talk and WP: pages. they are mainly for editors to improve the article [though that means the set of Talk pages are a top-1000 site in themselves, similar in popularity to Wiktionary --ed].
MS: references to the Internet archive are increasing. But there is no available list of "ten books currently on the net which might be relevant"...
(( slowly switching over the course of lunch from summarization to live transcription --ed. ))
Maura - time for people to emerge in their black Zotero tracksuits...
Greg - we need to create virtual bookshelves. Dan talked bout this at Rutgers in the fall.
[ mashups and APIs needed! ]
Steve - UMass aren't the only people who know how to do this. There are lots of amateurs who do this. Thousands. what allows them to do it is, the data they want to play with has a public API. and the text archives do not, so they are not... part of this revolution. If you go to programmableweb.com it lists over 4000 mashups and its mainly amateurs throwing the data together, mashing up with this that and the other.
The text archives aren't even on the map. the first thing we need to do is, though I want to - we want to support research groups dedicated to pushing research forward - open up that data through an API that makes accessible gazeteering data and word frequency data. It's not hard to do this and it would be revolutionary.
Maura - we don't control that...
DavidS - an API that gave you books but didn't allow search would be a very low-level one that most people doing mashups will not be able to take advantage of.
Greg - unless you can download the million books and put it somewhere, the API way does not work. You can't do the operations you saw.
Steve - the proof is the MONK project. we're not operating on the fulltext. none of the operations... there are a lot of them that don't require it. And just because the only way to get /everything/ you want requires the full text, does not mean that nothing could be done without.
DavidS - yes, it would be good if the metadata were better.
Dan - the metadata on Flickr sucks...
Josh - who were you talking about when you said that without full downloads you could not succeed? You as an API provider?
Greg - there are certain [services] you can't do well - such as named entity analysis. you need to look at the context of every word to do a good job.
Steve - If you're a text archive you have this data. The next thing you need to do is make that available to others through an API. Make dates and named-entity available through an API and it would be revolutionary.
DavidS - the text archives haven't implemented the tech to do these sorts of things. Before they can provide APIs they need to have it.
Brian?? - but that doesn't involve downloading the whole corpus. These services have to run at the archival origin.
Steve- the shift I'm suggesting is towards saying : "offer APIs to users, not just research results".
Greg- the API presupposes you have done the services. At some point someone has to have access to the whole collection.
Steve - the model emerging [here] was, we produce graphs [centrally]. I'm saying, produce the data and let others produce graphs. There was a time this was a specialist activity. The things people do with Twitter and Flickr, why aren't they doing that with text archives. that would be great!
Frank - As a general principle, not being a super Net geek, without a super agenda, it may not solve one [specific] problem, but is absolutely a formula for long-term success. Build the next simplest tool.
Steve - things I was doing 15 years ago when there were only 15 people doing them... now everyone is doing them. Back then there were no public APIs anywhere. you would broker relations -- very research/professorial -- with one content provider, work in a lab, generate a graph and show it (in publication).
Josh - my question is, there's a whole spectrum of what level API you [provide]. You could provide a URL that gives a batch dump of data. where do you / how much preprocessing of data do you put behind the API wall? this is a use question.
Steve - this is a solvable problem, not [as vague as] saying "how do we surface all our books".
[Use cases defined]
Greg - let's be clear about what we're /not/ worried about. Google books is a gateway and a barrier. here the context is, how much do you want to deal with? There's nothing to stop you from downloading the millions of books, though its a pain in the neck. There should be gradations of API. search, get me other stuff, all the way up to getting everything.
The intermediary here is - supporting subsets. For instance re-OCRing Latin and Greek [scripts]. I need all of those [original] pages back.
Brian - an example of a service which we can achieve with the help of an expert would run to improve OCR
Greg - the service is 'get me all the pages' [in a language].
Brian - but you have to do this once or twice and its done...
Mike - this leads into: there is a lot of specialized problem-solving that has to happen that is disappearing into your environments. How can researchers who are interested in doing bulk processing and discovery on this [get it]? How does what you find out get contributed back so that the next guy can build on top of it... Without redoing what has been done?
Greg - integrating automated processes and community-driven processes, which aren't automatically brought together. there are ways of doing this in the wiki community which we will explore. but...
Brian - so someone downloads the OCR and improves the transcription locally. how do you get it back there?
Greg - why not put it all on Wikisource, so you can always grab it, anyone can go edit any one page? Have a single standard.. like Distributed Proofreaders. integrate that into a wiki environment.
Peter - one intervention here from the [Internet] Archive's perspective: one thing we think about idly as we get closer to distributing ebooks on portable devices : taking a page from the Google mobile API. Rather than tapping on the page with an OCR error and getting an image, get an input form to correct and shifting it back. You end up with all the classic problems of not trusting that. Run it through a few users. The pain of incorporating OCR corrections into core text isn't trivial, the curve is uneven... it is a widely open and interesting question as to how many people will bother. secondarily that relates to how much penetration the Archive would have in making books available through channels compared to other sources of public domain texts. A lot of user-generated input slowly improving data channels as a subset of [other] approaches are appealing, but they are high overhead.
SJ - that may be high overhead [as conceived] because that model doesn't allow people to talk to one another and solve the problem on their own. There is no canonical place to go edit and update the OCR about a book. Rather than encouraging people to send their feedback to a central private group who filters it and decides what snippets are good, let them fix the whole OCR for a text somewhere, publicly; maybe have subject experts review the result every so often, but don't make the review private or centralized at all. [it doesn't need to be sterile and non collaborative like recaptcha.]
[Wikimedia Tool server as a host for interesting projects]
MS - I invite you to go to toolserver.org at Wikimedia Germany.... it [processes] lots of Wikimedia data. it could be a staging server for applications not scattered around the web that could be improved by other people you still need the place to collaborate and get feedback, but its more than just an API, you can develop and run scripts you can have infrastructure to build services and provide them to the general public. several of the current tools we use at WP are coming from this area.
Greg- Infrastructure is a big deal. I was skeptical of humanities tools; I was astounded?? by Dan Cohen when Zotero built something that actually worked and people cared about...
Steve - this idea about "how does it get back" - only a slice of the activity needs to get back. Last semester our students built this thing where you give the system a poem and it will pick a picture to illustrate the poem. what's fascinating is that you don't need much - what you would like is an API where you could say "tell me how many words are in this poem, how many times a particular word [shows up]" using some likelihood data. get some numbers out of the API and mash that up with Flickr - that is it. Now there is a new tool in the world, no data that needs to go back anywhere, lots of mashups have known algorithms and could be made totally transparent. There aren't results of the mash up that need to go back. There is a class of things where you want to facilitate people doing cool things with your data. WP is so cool we want to make it part of the experience and [tool set] that we have.
Mike - I also mean giving back [energy] so that the utility of this collection [of information/services] gets larger
Steve - in some way as people build better and better mousetraps everyone gets to share in that now Twitter tools have gotten better very week, and they are getting very serious [and technically interesting whether or not you thing Twitter analysis is itself interesting].
Mike - The Flickr community lets 3rd party communities expose their extra functionality : it is a discovery mechanism for services. It is a way for other devs [to share their tools/work].
Steve - it is the developers themselves that police that [related mailing] list.
[ Archival issues ]
Mike - am I wrong that Brewster was chagrined that the service layer on top of the Archive hasn't become more robust?
Peter - I can't speak for him, but there is a general sense of missed opportunity. we don't have some rudimentary APIs that help people produce services.
Steve - what is required for this? do we need people to do it for us, fund us?
A dozen use casesEdit
- 1. Discovery of material. Search.
- 1a. Solve the demand for discovery, not just supply of new visualization
- 1b. Solve metadata-linking to leverage existing refs and links between materials
- 2. Global IDs for shared metadata not separate walled gardens
- 3. Geodata grouping and visualization
- 4. As part of the other use cases : customization
- 5. Describe a development environment for all of this; use case: developers need to talk to one another
- 6. technical needs for storing and accessing data
- 7. Natural language processing
- 8. Topic clustering over subsets of books as a service. visualization techniques; gapminder.
- 9. Annotating references; global IDs for individual refs and links between documents
- 10. Classifications of clusters. hierarchy? other?
- 11. Scripted summarization of large datasets with machine-parsable results. Let readers analyse hypotheses based on these summaries for themselves
- 11a. Help all people answer common questions by drawing on large bodies of data. Semantic Wikipedia data. ]
- 12. Building a common platform for querying data. Similar to 1....
Josh - if the corpus is downloadable.. throw it all on a sever or in the cloud, and start to spec out -- it seems the single most valuable thing we could contribute is specifying out the API and what it should do.
If we can come up with a first stab, that requires no overhead up front. we could do a pilot with a copy of the data, come back, figure out a sense of a set of use cases for an API, and functionality and methods and calls we would want. try it, do it on a low capacity kind of platform without opening it to the world... iterate quickly and build what that service layer could look like.
Then the same process could apply to others as well, not just the Internet Archive. we could articulate some services we need on top to do cool things.
Greg - What are the use cases people want to throw onto the table?
[1. Discovery of material. Search. ]
Josh - discovery. I want API calls to let me throw a search term out. At NYPL some people know exactly what they want. Others have general search terms and want to winnow down - we're not so good with the latter.
Greg - this relates to the winnowing of term clusters. what does that [technology] do for us?
DavidM - do we have a sens of the million book contents we have, and some that are not usefully OCR'ed? 2/3 are English... what is the language distribution?
< David looks for data on this. Later : 2/3 are English, 16% are French and German. ?/Latin is another 8%... Fewer still are Spanish.>
[1a. Solve the demand for discovery, not just supply of new visualization ]
Dan - It's important to think of the demand side, not just the supply side. for 99% of scholars, getting to the close reading part from a good discovery experience is more valuable than graphing etc. The computational methods or textual analysis I would say - when I show David's thing about virtual shelves - eyes light up a lot [in the audience]. When I show some of what I consider more interesting [technical] graphs on word counts, less so. You could get into multiple languages here too.
The general open archive?? experience is terrible... David's shown one way to improve. This helps get over the red herring of criticism against these digital methods that talks about "the serendipity of going into [a big library]", which is the most annoying discussion you can have with analog scholars! You can say to them "we can give you something with twelve different types of serendipities... [faceted variations]
[1b. Solve metadata-linking to leverage existing refs and links between materials ]
The other thing we haven't discussed to get this outline started : in the WP community they have templates and can link data together [with existing library data]. See how we can use old-school references to tie things together with existing archives.
Peter - one thing professionally I want to see are cool services running out of the Open Library. It could serve as a nice switch point for derivative or secondary services that use it as an engine to provide data / a wrapper that can consume other things. ('provide for' in a nice way)
Examples: OPDS, a standard which came out of early Stanza work @ Lexcycle. We are working to keep this alive with O'Reilly and a few others. (this rocks --ed.) I want to throw a few IDs at Open Library and have something there throw back atom-formatted catalog files that i can throw on my site or blog. there are other discovery services we could imagine, including data enhancement.
Again this is back to the earlier point of SJ, how can the community help us do these things? I can speak for APIs at the archive : George Oates will be very focused on lots of UI goodness and social participation. The two of us will certainly figure out how to rope our priorities together so user input and machine oriented APIs live in a nice happy growing harmony. But there aren't that many engineers behind us [at the moment]. In the rest of the afternoon, think about cool things you also would like to see, think about how that work can be done elsewhere -- or how part of that could be processed so that Archive folks can take it and work with it easily. we are open to any of that. but we can't do all of it ourselves/internally [need outside help as well].
[2. Global IDs for shared metadata not separate walled gardens]
Josh - I have been heavily resisting flipping the switch on comments and other things at NYPL because I don't want that to be in our own walled garden. I want that to be enabled via web services, not storing that data locally but patching into a consortial approach. Whether it is Open Library of Wikipedia, if there is a canonical identifier for an instance of a book where I can say someone's looking at that page and send local reviews out... [great.] There are all sorts of issues - accounts, verification and reconciliation (and merging), that's the world I would love to see [shared globally].
[more use cases!]
Greg - everyone is here for their Tech background but also because they have a sense of the use case. so what are people actually going to do? Before we get into architecture. some APIs will be useful for some and not others... let's spend time with everyone with a sense of what tech could do,talking about what you would like to see happen. what actual operations you want to enable. Then we will generalize and prioritize.
Steve - I don't think this idea of open APIs is consistent with the statement that discovery is the fundamental analytic [/use case]... if you open the APIs and open the data to the groups that build them, one thing that gets worked on is IR (information retrieval) and discovery and things like that.
[3. Geodata grouping and visualization]
Josh - before we talk more about names and dates, lets talk about space. We have a location, I am on a street corner, what do I send back?
Brian - people are paranoid that essential things will be left out. but for a tourist standing somewhere else they want to know the four things that are relevant or very short summaries of fact. for that WP [or Google earth] provides a good measure of that. most people are not researchers, so that is a good model for that kind of summary delivery.
(DavidS, later - imagine you have a mobile device and you're standing in the middle of the street and want relevant information before you are run over...)
[4. As part of the other use cases : customization]
Greg - say I'm standing in front of the Isaac Royall house, I see this article. I get really interested. how do you explore everything there is to know about this? Cases illustrate one brane - I am interested in slavery [and ask this]. A person has a theme and want stuff from that theme and space. Someone else may be following the rum trade. when you have all this information... [you can customize what is discovered]
MS - as a European tourist, the notion of an 'Underground railroad' would be misleading. i had to read the article [just now on WP] to know what that term was about. It is more about access to WP, in a way that takes into account information I have culturally.
Greg - customization is a huge theme dear to our heart. [I'll pay you $50 later for bringing the topic up so neatly] Coming up with a model for your background knowledge... can we discover what is new for you? This matters.
Josh - I would be wary of tackling this in a way that was not strictly lowest-common-denominator. Imagine an iPhone app I open up focused on the use case of where you are. Could be a building or anything else. It uses the corpus of digitized books as a great body to start with a place, discerns clusters of topics, and gives an interface that [lets you choose] - provides an easy intro to the topics from WP, then offers a reading list. [others - and should let you choose how to refine by cluster, don't just choose for you but give options]
Brian - customization is a big topic in app development. You have little landscape and have to show the user the 10 lines that are relevant or let them [step forward through the process]. even a WP article would get them run over before they are halfway through it. you have to develop a mechanism that is customized also to the situation ... For mobile and wandering around this may mean "playback". Imagine their history of wandering is recorded so that when they are back they can replay it. That way... a similar significant aspect is preplay - the system knows you are somewhere, say at the Olympics. You come from a particular country and the nexus of services suggests to you an itinerary [and preloads it].
Greg - you need a crunching machine to provide an API. Choose Lowell, Mass - not the Lowell institute or other uses of the term. Download WP and books, use WP as a training set, and feedback all the Lowells; then sort according to a cluster? or some criteria (and let you choose the cluster you mean).
This is for when you are swamped with data.
Brian - you also want to reduce this by information in your profile. ((make 'filter by profile' an option, not automatic --ed.))
[5. Describe a development environment for all of this; use case: developers need to talk to one another]
Mike - say we decide what the prioritized tasks are. do we have a development environment we can drop this into that people would understand? my guess is no, so shouldn't we discuss a dev environment as a first order of business?
Greg - One thing I've talked to Brewster about is taking all of these books and putting/gathering them somewhere we can operate on them.
Brian - we haven't talked about this yet. GPUs...
DavidS - we are I/O bound, not processor bound ... [no places are offering platter, just CPU cycles]
[6. technical needs for storing and accessing data]
DavidS - OCR in Texas from all Archive books is 1.5Tb zipped. Images are a Petabyte, downloading them more slowly (and as needed?)
Greg - so you could ask for 10k page images out of slow storage and get them from the source easily.
Steve - say you take WP and use it as a training set to deal with the Archive. you will need major hardware.
Greg - it is mainly storage for the source. Then the output of each bit of research can be saved / served. so now we have an out of date service, maybe 100?? days behind? But every so often you go through and revise [update books] / disambiguate names in the public API's data.
SJ - could others provide physical disks of OCR text (the 1.5tb) as a service?
Peter - as long as people are contributing back to services based on our corpus, yes...we want to provide access to biblio data, subsets of data, &c.
[7. Natural language processing ]
Peter - we have been talking about NLTP. To create a different kind of discovery layer driven off of interesting phrases, etc. you could reel off the attributes that might develop. this is one thing the Archive would be immediately interested in helping out in whatever way possible. Even if no-one helps I might put together a couple of meetings to do that with the Archive corpus this summer. There are lots of software libraries for accessing this in Python...
< PB departs >
[ 8. Topic clustering over subsets of books as a service. visualization techniques; gapminder. ]
Dan - how on-demand can clustering be?
DavidM - <was away, is back!> There is an intensive clustering process that assigns every word to a topic. then there is a characterization of the topic, perhaps the list of top 10 words, and a distribution over topics for each book. For 1M books times 30k topics, of which 50 might be non-zero for any given book... so 50m database records.
Allison?? - What if someone had a thousand books they wanted to topic model? could you start there?
(( can you compute clusters over the whole corpus once a year, and in the interim simply incrementally map new books on to available topics? --ed.))
Greg - I want to see the advent of oral poetry; how does text related to that [topic] change? then there is the secularization hypothesis - can we revisit that now with these 19c books? sure.
Steve - there are exploratory models that let people pull out interesting [spikes, changes from a large set of data] and say look, isn't this odd?
Greg - let people ask for combinations of topics, like gapminder - visualizing a very large dataset on demand.
Maura - What about quote finder; a reference finder?
DavidS - like our quotation finder, yes. you see someone quoting lots of Kipling all together [and could catch a pattern]
Maura - Google throws it back to you, showing a work is quoted N times from 1910-1950.
Greg - when I looked at Google's reference data it was wowie-zowie engineer stuff but you couldn't get access to the data to see what is behind it. They don't know what this group [would want to] talk about - what would you actually need to do meaningful work?
[9. Annotating references; global IDs for individual refs and links between documents]
SJ - A friend and I wanted recently to annotate a body of research to pull out every reference and indicate sentiment -- perhaps building on top of automated sentiment assessment, but showing the current discussions about how something is cited (and whether it is a good or bad source for a given topic).
How can we associate data or comments to a specific annotation?
DavidS - there's good work on sentiment analysis for references based on NLTP.
SJ – sure, you can include automatic analysis as well; and you need to be able to update it by hand (( and debate how to understand a specific ref, and whether that means the author thinks the source is reliable or not ---> adding to the global 'reliability' of the source in that context.))
[10. Classifications of clusters. hierarchy? other?]
DavidM - we need to get down somehow from 1M books to 30k or so topics. As we buildup... yes we have the LOC classifications, but we would like ones based on the topic clusters we have generated.
SJ - are you thinking of naming clusters?
Allison?? - There was a project relying on cluster namings, but it was very hard to agree on how or when to name.
[ed - this includes a namespace issue]
Greg - there are a set of death bed-scenes. The more of this info you can get together, the better [for analysis]. then in the world of Thucydides there's what he thinks it is, and what other historians think at the time. they do this [analysis] by marking up the Thucydides text. In terms of secondary scholarship, they need this [data].
[aside: 'what method do you use?' look up "latent Dirichlet allocation"]
(( http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation ))
Greg - can we work with 10, 100, 1000 topics and then see what this means?
[11. Scripted summarization of large datasets with machine-parsable results. Let readers analyse hypotheses based on these summaries for themselves]
Greg - We take it for granted the collection is too big to read.
SJ - what is too big to read?
Greg - the million books.
SJ - I wouldn't say that. Someone will read each book. Let's say within 10 yrs someone who cares about this project reads half of the books...
Greg - I'd like to know that statistic!
(( Aside -- Wikipedia is /not/ written by robots and scripts and algorithms. It's written by people, and not very many of them at that -- half of it newly written in the past year. That's over a million articles; we can read a million books if we define the problem correctly and put our minds to it (and approximate notability so no-one's wasting their time on close-reading of the 50% that can be classed as redundant or superseded until the rest are mostly done) --ed. ))
Steve - with data analysis... say with Thucydides, you discover things which [academics] respond to with "of course."
Greg - But that's why Thucydides is interesting, i wrote two books on it but did not know this.
Steve - but you pretty quickly have an idea about why it happened.
Steve?? - when we found "bumblebees" might be an erotic marker in Dickinson that was weird, it didn't make sense.
(( from the MONK project. See : http://www.unl.edu/scarlet/archive/2008/04/24/story1.html ))
Brian - how do you indicate such a hypothesis so that someone else who comes along can use it in a machine [parsable] way?
DavidM?? - this isn't the same as an exploratory model.
Brian - right, start in a universe where everyone's assertions are assumed to be false. Take a model where you could somehow take this explanation and attach it to someone else's.
(( so we need a namespace for individual snapshots or pieces of data so many groups can refer to the same thing in their analysis. Who hosts the namespace / provides a UID mechanism and canonical URL for each entry? --ed.))
[11a. Help all people answer common questions by drawing on large bodies of data. Semantic Wikipedia data. ]
Mike - I heard the seed of a grant proposal there - "Our goal is to make it easier for scholars [Greg - people! not scholars] OK, people, better! - to answer questions drawing on large bodies of data." That's it. That's the whole thing.
MS - if I may draw on that.. Semantic MediaWiki is [getting better] which allows us to qualify relations between articles. I want this to let me quantify or confirm data related to a particular piece of information. One way is to draw on existing text with machine learning. Another involves more human effort. [Completing / speeding Semantic MediaWiki development] is a feature I would imagine being done that would be of great use to WP and WP readers. And that would make this information useful to anyone, basically.
Greg - so David, you alluded to sentiment analysis, something more detailed developed in Cambridge. Is that related here?
DavidS - you could look at the problem as information or relation extraction, the easy cases. "Is X Y?" The problem is there are all sorts of complexities in speech and language. I don't know what sort of performance you would expect for this to be useful for WP.
MS - if there were 10 candidate matches (for data / cites related to an article) and the actual ones were just 10-15% accurate, it would work for someone working on Wikipedia text.
Greg - Here is an example of a real world problem with a big audience. the analysis of the battle of Gettysburg... the interpretation of which varies widely and changes in the 19th century. I've looked at and digitized a lot of stuff. being able to get a rough analysis - what are reasonable chunks within which people agree? Reduce the problem to a tractable one. For instance, who lost a given battle? One general? who joins the Union Republican party after the war is then known as the villain who lost Gettysburg...? It's been a while since I've read this
(( General James Longstreet. WP addresses his changing legacy over time.
Greg - You have overwhelming data here, large audiences who would like to know the answer, and with data this becomes tractable. See also what people say about Lincoln over time.
DavidS?? - we should be wary about treating all documents as making factual statements.
(( Agreed! More data doesn't make everything tractable, but helps elevate debate. -ed. ))
[12. Building a common platform for querying data. Similar to 1.... ]
Mike - what is the common platform you need to go in either direction when querying data? Is that what we are hereto talk about building or not?
DavidS - possibly. There are some core language technologies that are useful for all of these tasks. Morphology, syntax analysis, discourse analysis. then there's Wikipedians marking up data with the additional information they would like to see. Or a set of links taking you to other examples / bibliographic references. But are you putting together really low level tools? or...
Mike - for you as a practitioner, when you solve a new problem, do you have a toolkit, a library of code? do you twitter your community to find out where to get a head start on writing the code?
DavidS - or even write the right mailing list... Generally we look at published high-level directions? of research, think about models people have used. Then there are collections of Open Source software that solve some pieces, but for most new ones you have to put these together yourself. There's no higher-level architecture that integrates these.
<Mike ponders this revelation>
Time check, break and reflectionEdit
Greg - we are at 3:15. Let's have a break and come back, go around the room and see what people would like to have happen. Here are my questions to you.
- Hypothesize you have 1.5M books no-one can do much with yet. Something more interesting could be done. What is it you do? Is that a bad [place to start]? What should we do? What can we do in the next year? What would make a difference? Do we discover things we've never seen before? do we help people organize books they've never seen before in a useful way?
Closing ideas, next stepsEdit
Working around the room, each person is asked to share their final thoughts.
[Dan on central services]
Dan - it seems to me OKC should be focused on services at the center of things. Services on WP that could then be put on zotero.org. say okc-suggest : DavidM's preprocessed topic clustering. you can right click on an item and Zotero will give you 0o the books in that topic. it can start on Wikipedia, as a line there.
We've talked about basically expanding the lower half of the page [for some time]. The sources section was not so good, Wikipedians are pretty lazy like the rest of us. And there was a push (( by whom?? --ed. )) to get print resources on there. We talked to some Wikipedians about that. In the same way, sources can say there's a recommendation system or suggestion thing that can go in.
There is tech Wikipedians use to find articles, enhance them. These can be services that will aid in that way.
[Dan on URIs, WP integration, and an OL wiki with data for each book]
Also a conversation is evolving around simpler things - getting good URIs for author names. ISBNs. any IDs put into WP become machine-readable and discoverable. To have an article that has UIDs in them, in a template or plain text, becomes useful for secondary services that might be useful for others. On the low-hanging fruit side, it would be great to get to some computational methods, but I think to have embeddable stuff now will get Maura more towards where she needs to be. I think working on things that will expand source rec. and referencing things in WP and getting unique IDs in there and linked -- so that human computation and eyeballs are effective.
It was impressive .. I think of Flickr commons for adding metadata to objects; we can move beyond text too. (Josh and mike's work [with media] comes in here as well.) This interesting quid pro quo where we give you 100k images and you add metadata to it -- the same could go for text. Have it classified or attached to subjects. you need good UIDs which open library may give you. They have effectively a Wikipedia inside Open Library. It would be good to talk to George Oates and find out what they are going to do with that. At Zotero we were talking to them and Aaron Swartz early on about read/write on their wiki, so we could pull metadata in from them.
Let people define services, put them on the [WM] tool server, but also have them out there so the Zotero guys are integrated as a right-click off of a collection item. Pure computation is interesting, but it's powerful what can be done with editors an human computation.
Mike - the book data enriches what is outside, but it works both ways; the rest of the web should also enhance user experience on that book at the open lib. (( Yes! --ed. ))
Dan - at open library there are fields lying fallow with tags, any number of things that would come out of usage on WP. That could involve transcription also [fixing OCR], which would enhance browsability as well.
[DavidM on faceted search : virtual-shelf tools for public browsing]
DavidM - I am 4 yrs out?? from providing a service to the public, but I would love to see virtual shelf provided to the public.
Maura - what is the elevator pitch for this?
DavidM - the idea that you go to the library and your book is missing but the ones next to it are fine, is not accident. There is research in library science for decades that you want faceted search, not just the one linear facet. We have ideas now and technology to do in a data-driven way 30k facets from a million books.
(( can you convert clusters into a linear sub-dimension for books that are defined within it? could you use that to move from a book that has no value along that dimension to the 'closest' ones that have some value in it? ))
[Steve on APIs and infrastructure for mash up culture]
Steve - I want to see the different text archives participate in the mash up movement that is incredibly exciting. I also want- Google books is not at the top of these mashups, b/c their APIs are half-assed. The open archives have the opportunity not to be half-assed and if you're a geek you would want to take advantage of these services. You would need to park them on high-performance servers, build a good tech infrastructure... get a good grant and consortium.
I think this addresses directly the issue of getting good metrics. open the gates, make it an exciting thing to hack.
[David B?? on providing context to readers of other things online]
Q [David B??] - Let me jump in out of order, to follow on the last two comments on what to do with these 1M books. Most people using WP don't need this raw text to do computation on. they won't use an API to make a mash up they just want to know something about [a topic]. In that sense, it's really stuff about providing contextual info for the book they are already looking for that is most helpful.
This came up most for me with regard to the Thucydides/ timeline. The secondary info about him changed in the 40s to reflect WW2 interest in Power. That may be self-evident to a classical scholar [who says "of course"], but it is not to my mom. That's the kind of [valuable] knowledge [average readers can] get from these collections, that might be helpful to the public sphere.
Greg - this is the special forces view: not the air force at 35k feet, not infantry; you see something interesting, and drop onto the ground for full details.
David B??, at the end - I already spoke. But to reiterate what I said before, we're sitting here as uni scholars. But academics already have the resources/structure to do what they need with these collections. the audience to focus on are the general users who just want to know stuff, like my mom.
[SJ on Wikisource for canonical OCR/translation, granular UIDs for stats and references, and a Search tool]
SJ - three things.
* have a place for OCR and translation. wikisource?
* have UIDs for individual stats. provide a tool for publishing things you find, producing UIDs that can themselves be referenced in articles.
* provide a tool for Wikipedians to search books / do research in a way that automatically links to them to a canonical URL for the original work. Currently most Wikipedia research is done via googling the Web at large. Google books isn't the best tool for that at the moment - you cannot access the raw OCR'ed text, deep-link to named/numbered pages, or add global translations or comments. As a research community we cannot modify that interface.
Visible outcomes on WP should include changing standards of notability to include # of inbound references, popularity on this site, &c; directly using internal links across the million book corpus.
(( see also : provide slices and collections in a downloadable or snail-mailable! way for offline use, to further distribute the computing load to local research groups that want to do their own bulk analysis. ))
[DavidS on wins for computer science as a field]
DavidS?? - use the 1M book corpus to generate a new ?? seems ideal. Courses that deal with the web are often the most boring parts of the department. [But this is fascinating.] If we thought hard about building stats models and deep processing, [where new work can be done] that would be a real win for cs in general. [a professor's take!]
[Maura?? on the variety of audiences involved]
Maura?? - coming after David and listening today : the audience interests are really different, depending on who you are and where you stand. I had teachers? write in and tell me regarding the American Experience: you should do a section on these important topics, addresses. this is an avenue that needs to be explored : define the user experience, the package of information they are getting at the most general level. a lot of people use WP, but when talking about reaching millions of people you need to find out more about their needs. what can you do for each audience? Can you suck them in, get information from them?
DavidS - when the teacher writes to you saying 'make us a 10-min video', lots of people could do this faster than you could. It's worth sharing that task.
(( Aside about how Steve is actually doing that this summer? I missed this --Ed ))
Greg - WP exists because there are more than flyby users. there are people who dig in. One [core] user group are the people who want to sit down an work. Do you focus on the smaller umber of people who will produce a lot of value? Also on the flyby readers?
[ Agnes on the conversation from a historian's perspective ]
Agnes - it was interesting to learn... at first I thought we were thinking about structuring the 1m books to make the content more useful, in the end I understood maybe it is not a general structure but to learn methods. this is what David told us : there is education to handle new methods. But the work we do as historians does not change... there is a huge [amount] of media & inno. and we have to keep in mind what we are looking for. text mining and so on. It would be useful if archaeologists of a future generation can -- we can learn now, but future students can know these [methods] from the first semester.
Greg - there is more resonance to educating people today to use the tools of the present and future...
((I think we did in fact decide to focus on making the material useful and away from developing methods for their own sake --ed. ))
[Brian on focusing on readers and demand for services]
Brian - concentrate on the benefit to the user. That means readers, not content researchers, not students, but broad public users, not scholars or educators. Take as my user base people who use Wikipedia. [readers] They are already there to find something. Don't need to pull them in. Build a discovery tools for IA that lets you discover things by topic. Use David's software perhaps, a simple API that shows you what is available, and feed back topics with weighting, change the scoring a bit. Get back a ranked list.
Make this an extension of external links in WP. But build this as a portal similar to the Firefox Googlepedia plugin: Wikipedia on one side, this list of 10 best hits on the other. Concentrate on the API, not the UI, but the UI would give a ranked list of the 10 best suggestions for further reading.
Hand-curate the first cut. Feed this in as a first cut and iterate. Let many people do interesting data mining on Wikipedia itself. that would be doable within a year. DavidM's part would be hard perhaps.
Then you don't have to build for each use. you could move from the Archive to other repositories.
((Adding 'whatlinkshere' and discussion pages to books will make a big difference.))
[Mike on stimulating and encouraging developers, and quick faceted search]
Mike - Stimulate the developer community! We're making that mistake... it is basic marketing 101. Allen Hur?? who was at SI 2.0, platform development at Myspace - the smartest guy I've talked to about stimulating dev communities. He gave a great talk at O'Reilly Graphing Social in 2008. They do developer jams, gather developers together, write code. Teach people what is possible and learn what is [interesting] with this corpus of data. Inventory and promote it.
[Get to] faceted search right off the bat. Endeca did faceted search; we were astounded about how this would work for structured and unstructured data. They could pilot 10,000 books chosen at random in an afternoon.
Then decide who the audience is. there's a guy who wrote "We don't make widgets", a business book. Decide if the audience is developers, Wikipedians, end users, lifelong learners. Choose one consciously and [satisfy] that niche to the greatest degree you can. That will unlock [other uses]...
(( http://www.wedontmakewidgets.com/ ))
Maura - Even if your niche is so wide as to be "Wikipedia users".
Josh - That's not that wide compared to /all/ users.
Mike - When someone says 'we want to see more use of this', ask: by whom?
Steve - WP divides its audience into many groups : core users, regular readers, [developers,] etc.
[Josh on services to provide]
Josh - If I were going to write a proposal today, what would I do? Pull together a number of strands. Brian put it well. Not primarily that books are a resource to be read, but that they area a data resource to be mined, and that can produce valuable service As intermediaries and ways to knowledge.
I would frame a project, saying : we will put the books in a black box with processes and focus on inputs and outputs -- and apply them to multiple use cases. It's a big enough and important enough topic that, well, 3 is a nice number:
Say it will build out a set of services on top of this corpus, fundamentally drawing off of topic clustering we saw; the 3 uses would be:
- an informal learner who wants to know what they are learning
- a library user discovering books
- a WP contributor who wants to find more concrete material on the topic of their article.
Each of these is a distinct use case, they have come up in different threads; diff use cases and communities, and you could find strong partners in each. Imagine development arcs in which a group of people representing different perspectives come together to define the shape of the UI. Then a first pass, people trying to build in parallel different uses of a data service.
Finally an interaction with testing and broadening the spec to open it up to others. Do this with the whole corpus, a slice, etc. A concrete definable thing that answers the fundamental argument of how to use this stuff, what the value is [to that group]. A couple different baskets would be strategically useful.
[Mathias on the speed of technological change]
MS - I don't want to repeat what has already been proposed. there are a lot of good suggestions including yours josh. There is a picture in my head I cannot get rid of. It is a student of India who said : "I like Wikipedia so much!" When he was asked what he loves best he said, "There are these great pickup lines in WP that I can use..."
((I wonder what he meant! --ed.)) His obvious idea would be some connection between Wikipedia and Shakespeare's sonnets, which have been working for centuries...
Any proposal I can think of runs the risk of sounding too narrow because of the technical limitations changing so rapidly. Flash sizes are changing rapidly now. 2TB flash cards are becoming possible.... so people may not need an API, but rather services that run on a portable device on the corpus directly.
I am unable to propose any simple idea that will not be superseded next week by what's possible, including the tech to run a supercomputer under your desk [if you need computational power]. In the end it will be just providing some infrastructure to build up applications for a large audience. And defining the area in which this development should happen.
[Francesco on updating the notion of 'book' from one author's snapshot of thousands of works to a living mural of those thousands]
Francesco - I may be the wrong person to speak but what really got my imagination was from the DEMOS book that gets us straight from the paper [analysis] to the sources... a professor who was visiting Perseus last month said what we call a book is an immense cultural artefact composed of [bits from] millions of books.
If we can move back from this simple?? work that we call a classical book to [engaged] exegesis and tradition of discussion, to get around these compromises [imposed by traditional publishing] -- that is what I want.
(( great idea -- implement two-way links as WP does, ways to capture refs and usage and comments side by side with pages of a work... --Ed))
Greg - I remember Richard Lannon?? said that Americans don't care about anything, they know only superficial things. In fact people know things that may not count from an academics perspective... but they know [them in depth]. They know about every Ford, or every gun in the national history museum. People will come with a very precise knowledge of that history and metallurgy and you'll hear about it [if you ask]. That depth is a broad and beyond-[traditional-]scholarship set of knowledge..
[Francesca?? on a public wiki discussion of classics research, and preservation of Italian books]
Francesca?? - I want to see a new classics encyclopedia... I am from Italy and know that elsewhere there is a prejudice against this sort of [public] discussion and open analysis. We have to try to build a wiki academia I think. And as an Italian I would like to see of course more Italian books in Google Books. We are losing our patrimony in Italy; because our libraries are often in bad conditions. So it is / can be a future problem.
[Alison on collective classification, a librarian's view]
Alison - there are things from my library background I find particularly interesting. Getting and vetting contributions. I like Dan Cohen's idea of getting and moving data; having archived scans made for us : worldcat is trying item search based on subject headings; the power of having related items based on so much more than 3 headings [perhaps by someone who didn't have a lot of knowledge of the book]... The opportunity of having that many hours of labor, having people visiting Wikipedia and then seeing this million books : the ability to link all of these silos of data [and build better categories with a lot of contributing eyes]. This is an important idea.
[Marielle?? on sharing and crossing the language barriers involved]
Marielle?? - the education of historians is another important point. I'm not technical, but the idea of WP and the 1m book library is based on sharing knowledge, transparency of practices on the working team. the sharing of each participant's work is very important. It is new for history! As a historian I say also : building on this archive requires collaboration, since history is about knowledge.
As a digital humanist I am pleased to hear what I have heard, something that isn't always clear in Europe, that history is about language too, and when building a system... we have to [plan for and] deal a lot with linguistic ambiguities that historians sometimes forget.
[SJ on wrapping this into a public place to continue conversation]
SJ - A final thought : we need a place to open the conversation, to share ideas about how to move forward. A place to publicize ideas for services wanted, and groups interested in making existing resources more useful to others (both things I specifically heard today).
<Greg asks for other comments>
Greg - Thanks to everyone for coming; we seem to have covered everything with time to spare.
<we adjourned a bit early>
- A WP service that makes available services prominent -- easy to find preferences from the main preferences menu.
- A service - let people add original works by OCR to wikisource, and edit there. OL can eventually take this over and own it if they want, that would be swell. But wikisource can handle this in all languages at once tomorrow. A script is needed that would process a request to import a source work, find a canonical text for it (perhaps out of hundreds digitized; find the right FRBR level) chunk it into pages, and scriptably create a multi-page Wikisource entry.
- A service - generating / storing UIDs for references/cites/individual stats. When these merge or change names, maintaining the namespace with redirections. Similar to how WP provides an authority file for article names... except WP doesn't guarantee them forever, they can be deleted and overwritten with different entries. So a slightly different service that provides permanent UIDs for bits of knowledge to be referenced. Third party archives would be responsible for maintaining maps between these UIDs and their own custom ID systems.
- A service - REQUEST FOR ANALYSIS - "I want to know the # of pages per decade referencing Vergil's Aeneid." the result could be a public response to a public request with a UID link taking you to the result. "I want the # of references of deathbed scenes in the Perseus corpus over the 20th century". Generate pubic URLs that others can reference, and discuss this in other fora in various languages. ]
- A meta-service - DISCOVERY OF SERVICES AND CONVERSATIONS - find what others are using book data and services for. Find what they are using WP data and services for. Improve the visibility of existing projects, and provide a way to contact people doing work a reader/researcher is interested in. Provide an aggregate channel for all conversations around these services (the conversations can start elsewhere, but there should be a way to find pointers to many of them in various languages and communities).]
- Wiki academy : include a wiki academy day for academics and students in the New England Wikipedia/free-culture conference, perhaps on Friday July 24.
- Wikipedia Project idea : a 'Wikidata' wiki for sets of raw data. For data rather than for analysis -- consider language-trees.
- A media sharing service : share images and media from the NYPL and Smithsonian. Provide better APIs for same, find out how Wikipedians would like to search, access, and upload these. Identify new sources of freely licensed media (including for instance the originals up at http://wdl.org ) and help Wikipedians and librarians push them out to a wider audience.
- Directly address the need for new notability metrics based on raw publication data for books, number of references to scientific works. Engage the research and book wikiproject editors on major language Wikipedias.
- Finally, many people today used the word "disambiguation" in a meaningful context. I love it -- I believe this is a neologism spawned by Wikipedia -- can anyone recall using it before 2004?? :-)