UC DLF: Participatory Digital Collections15 Feb 2018
Below are the slides and speaking notes for our Collections Lab presentation on “Participatory Digital Collections: Opportunities for Enhancement, Experimentation, and Transformation,” presented at the 2018 UC DLF at UC Riverside.
Participatory digital collections and the Collections Lab
When we talk about “participatory collections,” we are often discussing the projects that facilitate user engagement through crowdsourcing of new data, such as transcriptions, tagging, and description. For this talk, we expand on this understanding of “participatory” to include all forms of engagement that encourage the creation and/or reuse of collections data for teaching and research, as well as efforts to contextualize or reinterpret collections through new interfaces and access points.
The UCLA Library Collections Lab is a space to promote more participatory collections and encourage the library and its users to consider our collections as data. Here, we experiment to enhance our collections and build out our “collections as data” program through both crowdsourcing, computational, and design approaches. We don’t currently have the open digital library system of our dreams (although we’re working on that) - until we do, the Collections Lab is a space for:
- Highlighting collections where there are no (or fewer) restrictions on reuse – we’re making collections data available via GitHub, provide a form for users to request collections data as needed, and provide scripts to download data from a few other digital collection repositories
- Computational and machine-learning experiments to enhance metadata and process collections so that they are more amenable to computational research methods (Pete and I will discuss a few of these shortly)
- Researcher and classroom projects, such as the “Los Angeles: The City and the Library”, where students digitize and give context to Los Angeles-related materials from UCLA’s special collections
- Crowdsourcing projects, like the William Sachtleben travel diaries transcription project
Collections Lab & BuildUCLA
Machine-learning experiments: Book Annotation Classification Engine
One of our BuildUCLA / Collections Lab experiments is the Book Annotation Classification Engine, a
With so many institutions coalescing around IIIF, we wanted to start thinking of other ways libraries and users could “act” on our digital collections in this sphere. The ability of IIIF to pull disparate collections together using IIIF manifests and viewers really opens up the possibilities for users to engage across multiple collections and institutions.
One of our current projects with UCLA’s Clark Library is the Annotated Books project, an NEH funded project to digitize, describe, and publish annotated books from the Clark collections. In addition to the book catalog data, the descriptions also include data about the annotations themselves. This got us thinking – how great would it be if we could get a machine to find all the annotations for the researchers and maybe even classify them? — and could we do this within the IIIF ecosystem?
Pete and I had just started working together on BuildUCLA, so a joint BuildUCLA / Collections Lab venture was inevitable. We recruited some talented computer science undergrads and a few colleagues to start experimenting. The resulting product will (hopefully) allow users to apply the trained models to any IIIF collections that have manifests enabled. These collections then can become platforms for machine-learning based discovery and extraction of latent collections data.
To make this work, we need loads of training data. I’m new to machine-learning, so did not initially realize how much training data was needed to get us started. For the training data, we started by making use of existing digital collections to create datasets.
The students have been busy with the collection, organization, and cleaning of collections data to prep for the work ahead. They are also experimenting with various approaches and are seeing some small successes while they learn more about machine-learning and computer vision approaches.
We are also employing crowdsourcing to build training data sets using the Zooniverse platform. We plan to reach out to communities that might be interested in book annotation and we invite anyone to contribute by tagging annotations on uploaded page images to help generate additional training data. Of course, the data will be made available for anyone to use.
Discussion: Now that we’ve talked about some of our projects and how we see these contributing to more participatory / dynamic digital collections, we’d like to move into some discussion on how we might set ourselves up for success. This often begins with selection: what criteria are we considering when we decide to digitize? Many of our digital collections have been curated/built on principles of “special-ness” - those high profile, unique, or beautiful items – or to increase access to hidden collections or provide surrogates for materials that get lots of traffic or are in need of preservation. These are great reasons to digitize; however, we’d like to discuss ways that we might extend these criteria to foster more participatory collections – What new criteria might we consider when selecting items or collections for digitization? And what criteria might we use when considering whether to adopt, build, enhance born-digital collections or those that were previously digitized? Here are a few points to get the conversation started:
Digital Collection Criteria to promote more participatory collections:
- Does the collection, if digitized, lend itself to computational approaches?
- Is there potential for generating new data if digitized?
- What are the transformation possibilities if digitized?
- Does the material have appeal beyond a small scholarly community or to a community that would take advantage of crowdsourcing/enhancement opportunities?
- Is the collection sufficiently large that digitization will allow for computational methods that will open up new lines of research not possible by individual scholar/analog approaches?
- Would a community benefit from having the collection digitized and open so they could reimagine / re-present the collection in a new way?
We might also discuss what it takes to encourage the types of user participatory engagement/behaviors that the panelists touched on in their presentations. For example:
- well-defined crowdsourcing and user-contribution features – the infrastructure should encourage user contributions but also prevent the collection from turning into grandma’s attic
- computational access should be provided (ideally) via bulk access APIs/downloading features, or at least a means of addressing individual items online in formats that are most amenable to computational processing (e.g., full texts rather than PDFs, cropped images rather than full-platen scans); if these features can’t be provided for copyright, privacy or technical reasons, then sophisticated server-side searching, annotation, visualization features are preferred