Text Mining #2: How scholars can support digital libraries

Guest Blog Post by Ted Underwood, University of Illinois, Urbana-Champaign, tunder@illinois.edu

Libraries, of course, support scholarship. But in the twenty-first century, scholars might occasionally be able to return the favor. As scholars develop experience mining large datasets, there is a growing area of overlap between the things libraries need done (organizing collections, enriching metadata), and the tasks that count as research in other fields. This area of overlap holds opportunities for both sides.

It also holds risk. As Michael Pidd has pointed out, the scholarly landscape is already littered with ad-hoc data sets designed to support a particular inquiry, without much planning for re-use. The implicit warning seems clear: the task of organizing cultural heritage will not be solved by research groups working in isolation. Libraries remain central - as sources of expertise, sites of collaboration, and of course as institutional guarantors of permanence.

But there may still be a role for scholars to play. Once large digital libraries have been organized, and standards have been established, individual researchers can usefully supplement a collection with provisional metadata schemas that point to standard identifiers, and thus (with any luck) leave a trail of bread crumbs for others to follow.

I've attempted something like this recently in collaboration with HathiTrust Research Center. HathiTrust receives metadata about books from its member institutions. Although it has done a great deal to harmonize different sources, it still inevitably shares some of their blind spots: for instance, records for nineteenth-century volumes usually lack information about genre. For literary scholars, this is an important gap. We're interested in genres (like poetry or drama) that may be widely scattered both within and across volumes.

To find as many of those texts as I could, I trained predictive models of genre and swept them through 850,000 volumes in English, producing genre metadata that points back to specific pages in HathiTrust. This metadata can be freely shared, and to some extent so can data about the texts themselves. Licensing arrangements often make it hard to share full text, but scholars pursuing research on this scale don't always need sequential full text: a public dataset of word counts associated with pages in a particular genre has already been useful, both in my own research, and in projects by other scholars. Using the same data, Chris Forster has investigated questions about gender and Jonathan Goodwin has produced topic models of fiction.

This project encountered many challenges, and we haven't solved them all. Algorithmic prediction is imperfect, so it's important to communicate levels of error. Digital libraries also change and grow; in some cases, my page-level predictions are now out of date, because new scans of a volume have altered the pagination. (In the next iteration of the project, we hope to address that problem with persistent page-level identifiers.)

But the sine qua non for the project to happen at all was HathiTrust's willingness to make their collection as open as legally possible. This required some imagination, because digital libraries inherit a model of scholarship organized by "search." The scholarly task is envisioned as recovery of a single document, or of a small collection; a library's task is to make those documents discoverable and serve them up one by one.

Scholars working on a macroscopic scale need a basically different model of access. Portals, platforms, and interfaces are largely beside the point for us. APIs are good, and I'm grateful that HathiTrust has APIs for both text and metadata. But even an API is ultimately a straw. You can't suck 850,000 volumes through a straw without inconveniencing someone.

I hesitate to admit it, but what data-mining scholars really want is to have as much of the library as physically and legally possible on their own machine (or a local cluster). I know this sounds strange. It reverses the ordinary relationship between library and patron. But for scholars who do data mining, bulk downloads are essential. (For instance, if I had asked the library to separate out a collection of "fiction," I would have created a lot of extra work for librarians - and would also have overlooked a lot of volumes. The boundary between fiction and nonfiction was something I needed to map myself; and to do it, I needed to start with a collection much larger than "fiction.")

How much data do text miners need? "As much as possible" is a flexible metric. But, for instance, if I can have a single table summarizing the metadata for every volume in a library, that's definitely something I want. If I can have the MARC records, I want those too. If I could have the full text of all volumes in English, I would want that as well -- but alas, copyright law prevents sharing full text beyond 1922 in the US.

HathiTrust Research Center is developing a secure virtual machine (the Data Capsule) that will make it possible for scholars to pursue research on texts indirectly, without copying the text itself. The flexibility of a virtual machine will represent, for many research problems beyond the wall of copyright, the only viable solution. But even here it may often be possible to legally export summary representations (word counts, for instance), and whenever that is possible, I think researchers will prefer to download data in bulk and manipulate it on their own machines.

The use cases I'm decribing are admittedly a minority. There are still many more people browsing records one by one, through some kind of search interface. But although we represent a small slice of the user community, I've tried to suggest that data-mining scholars can be a useful slice. Our goals align closely with the core mission of a library: we want to make the collection as a whole more transparent and intelligible for others. To accomplish that, we need some way to download whole collections (for instance, through rsync, which makes updates easy to manage). For this group of users, the details of data formatting are not critical (xml or plain text are both fine). Search interfaces are not very important. What we need is timely and comprehensive bulk access to metadata (and as much data as legally possible).