Data Management

A brief overview of machine learning practices for digital collections

Northeastern University Library’s procedure for digitizing physical materials utilizes a few different workflows for processing print documents, photographs, and analog audio and video recordings. Each step in the digitization workflow, from collection review to scanning to metadata description, is performed with thorough attention to detail, and it can take years to completely process a collection. For example, the approximately 1.6 million photographs in The Boston Globe Library collection held by the Northeastern University Archives and Special Collections may take several decades to complete!

What if some of these steps could be improved by using artificial intelligence technologies to complete portions of the work, freeing staff to focus more effort on the workflow elements that require human attention? Read on for a very brief overview of artificial intelligence and three potential options for processing The Boston Globe Library collection and other digital collections held by the Library.

A three-part cycle, with "Input" leading to "Model Learns and Predicts" leading to "Response" leading back to "Input"

What is artificial intelligence and machine learning?
Artificial intelligence (AI) is a broad term used for many different technologies that attempt to emulate human reasoning in some way. Machine learning (ML) is a subset of AI where a program is taught how to learn and reason on its own. The program learns by using an algorithm to process existing data and find patterns. Every pattern prediction is evaluated and scored according to how accurate the prediction may or may not be until the predictions reach an acceptable level of accuracy.

ML may be supervised or unsupervised, depending on the type of result needed. Supervised learning is when instructions are provided to assist the algorithm to learn how to identify patterns expected to the researcher. Unsupervised learning is when the algorithm is fed data and discovers its own patterns that may be unknown to the researcher.

Ethics
As we undertake this work, it is important to be aware that AI technologies are human-made and therefore human biases are embedded directly within the technology itself. Because AI technologies can be employed at such a large scale, the potential for negative impact caused by these biases is greater than with tools that require standard human effort. Although it is tempting to adopt and employ a useful technology as quickly as possible, this is an area of research where it is imperative that we make sure the work aligns with our institutional ethics and privacy practices before it is implemented.

What AI or ML techniques could be used to help process digital collections?
OCR: The most widely known and used form of AI in digital collections practices may be recognition of printed text using Optical Character Recognition, or OCR. OCR is the process of analyzing printed text and extracting the text objects, like letters, words, sentences. The results may be embedded directly in the file, like a PDF with OCR’d text, or stored separately, like in a METS-ALTO file, or both.

Screenshot of the front page of the Winchester News
Image source: Screenshot of an OCR page of The Winchester News with METS-ALTO encoding opened in AltoViewer.

OCR works rather well for modern text documents, especially those in English, but a particular challenge for OCR is historical documents. For more about this challenge, I recommend A Research Agenda for Historical and Multilingual OCR, a fairly recent report published by NULab.

A screenshot of a search result that reveals the result was returned because the search term matched OCR'd text within the document.

We can already see the benefit of using OCR in the library’s Digital Repository Service, as files with OCR text embedded in the file have the full text extracted and stored alongside the text file. That text is indexed and improves discoverability of text files by retrieving files that match search terms in the file’s metadata or the full text.


The back of a photograph from the Boston Globe Library Collection, featuring difficult-to-read handwritten descriptions.
Digitized back of a photograph from The Boston Globe Library collection.

HTR: Handwritten Text Recognition, or HTR, is like OCR, but for handwritten, not typewritten, text. Handwriting is very unique to an individual and poses a difficult challenge for teaching machines to interpret it. HTR relies heavily on having lots of data to train a model (in this case, lots of digitized images of handwriting), so even once a model is accurately trained on one set of handwriting, it may not be useful for accurately interpreting another set. Transkribus is a project attempting to navigate this challenge by creating training sets for batches of handwriting data. Researchers submit at least 100 transcribed images for a particular handwriting set to Transkribus and Transkribus uses that set as training data to create an HTR model to process the remaining corpus of handwritten text. HTR is appealing for the Boston Globe collection, as the backs of the photographs contain handwritten text describing the image, including the photographer name, date the photograph was taken, classification information, and perhaps a description or an address.

Computer Vision: Computer vision refers to AI technologies that allow machines to work with images and video, essentially training a machine to “see”. This type of AI is particularly challenging because it requires the machine to learn how to observe and analyze a picture and understand the content. Algorithms for computer vision are trained to identify patterns of different objects or people and attempt to accurately sort and identify the patterns. In a picture of the Northeastern campus, for example, a computer vision algorithm may be able to identify building objects or people objects or tree objects.

A black and white photograph of a man being arrested by two police officers next to an analysis of the photo's contents: Footwear (98%); Shoe (96%); Gesture (85%); Style (84%); Military Person (84%); Black-and-white (84%); Military Uniform (80%); Cap (80%); Hat (78%); Street Fashion (75%); Overcoat (75%)
Result of Google Cloud’s Vision API analysis for a black and white photograph.

When used in digital collections workflows, the output produced by computer vision tools will need to be evaluated for its usefulness and accuracy. In the above example, the terms returned to describe the image are technically present in the photo (the subjects are wearing shoes and hats and overcoats), but the terms do not adequately capture the spirit of the image (a person being detained at a demonstration).

There are a lot of ethical concerns about using computer vision, especially for recognizing faces and assigning emotions. If we were to employ this particular technology, it may be able to generate keywords or other descriptive metadata for the Boston Globe collection that may not be present on the back of an image, but we would need to be careful to make sure that the process does not embed problematic assessments into the description, like describing an image of a protest as a riot.

Computer vision is already being employed in some digital collection workflows. Carnegie Mellon University Libraries has developed an internal tool called CAMPI to help archivists enhance metadata. An archivist uses the software to tag selected images, then the program returns other images it identifies as visually similar, regardless of its box and folder, allowing the archivist to easily apply the same tags to those visually similar images without having to manually seek them out.

Many other aspects of AI and ML technologies will need to be researched and evaluated before they can be integrated into our digital collections workflows. We will need to evaluate tools and identify the skills that are needed to train staff to perform the work. We will also continue to watch leaders in this space as they dive deep into the world of artificial intelligence for library work.

Recommended resources:
Machine Learning + Libraries: A Report on the State of the Field / Ryan Cordell : https://blogs.loc.gov/thesignal/2020/07/machine-learning-libraries-a-report-on-the-state-of-the-field/
Digital Libraries, Intelligent Data Analytics, and Augmented Description / University Of Nebraska–Lincoln: https://digitalcommons.unl.edu/libraryscience/396/

Learn to Write a Data Management Plan, Find Out What Social Media Knows About You, and More

"You Are Here" artwork by Mario Klingemann

How does your commute make you feel? Map it! What does Facebook know about you? Download your data! What do you need to say about your data in a grant proposal? Learn about data management plans!

We’re hosting a few events this month to coincide with Love Data Week and Endangered Data Week, and you’re invited to:

Check out the full lineup and register for your spot: bit.ly/snelldata19

“You Are Here” by Mario Klingemann on Flickr, CC BY 2.0

BYO data & code! Preparing for reproducible publication workshop

Data code workshop flyer NU

We’re bringing Code Ocean to campus on Nov. 8th for a hands-on, interactive workshop.

This 2-hour session is a unique opportunity to bring order to your own data and/or code!  You’ll receive expert, step-by-step guidance on:
  • Organizing your files
  • Creating a codebook (so that others – not to mention your ‘future self’ – can understand how & why you’ve done things)
  • Preparing your code & data for documentation and reuse
  • Maximizing the potential reproducibility of your research outputs
Space is limited to about 20 attendees, so please register soon to reserve your place. More info and registration link is here.  Questions?  Contact Jen Ferguson or Tom Hohenstein.

Data Fest is coming in February

Since Love Data Week and Endangered Data Week both happen in February, we thought we’d use this month to showcase some of the great data-related services and resources we have to offer here at Snell. We’re calling it Data Fest, and you’re invited!     Here’s a taste of what we have planned: Stop by and lend a hand at our Citizen Science: Health Hackathon Make friends with your command line at our Intro to the Unix Shell workshop Learn how to create impressive charts & data visualizations at our workshops on Tableau and free web-based tools   And more! Check out the full lineup and register here: http://bit.ly/snelldatafest18   

Meet the 2017 CERES Exhibit Toolkit Projects!

The DSG is proud to announce the projects chosen for this year’s round of CERES Exhibit Toolkit development. We will work with the following four projects to implement enhancements and new features to improve user experience, create additional exhibit tools, and incorporate the Toolkit in the classroom:  

Boston as Middle Passage

In 2015, students and researchers working with the National Parks Service built a website to preserve research documenting Boston as one of many transatlantic slave trade Middle Passage sites. Sadly, in less than two years the site has become unusable due to server issues and lapsed hosting. This year we will work with the creators of the site to transfer the rescued research materials to the DRS and recreate the original exhibits in the Early Black Boston Digital Almanac (a 2016 Toolkit project still in development).  

Dragon Prayer Book

The Dragon Prayer Book project is a research endeavor led by Erika Boeckeler, faculty in the Department of English, to study the Dominican Prayer Book, a fifteenth century manuscript held by Archives and Special Collections. The Dragon Prayer Book project was accepted as a Toolkit project in 2016, and this year we will work with the project team to enhance the Toolkit’s IIIF high-resolution image viewer: http://dragonprayerbook.northeastern.edu/mirador/  

Freedom House

As part of their ongoing effort to highlight archival collections using online exhibits, last year Archives and Special Collections used the Toolkit to create and set of exhibits for the Freedom House photograph collection: http://freedomhouse.library.northeastern.edu/. This year, Archives proposed a new browse feature that would allow them to build dynamic exhibits that could bring together all Freedom House materials that match a particular subject term, like “Kennedy, John F.”. This enhancement will allow Archives and other Toolkit site builders to create dynamic exhibits that automatically populate with DRS materials matching particular subjects, creators, or other faceted metadata values.  

Literature and Digital Diversity

This fall, Elizabeth Dillon and Sarah Connell will be co-teaching Literature and Digital Diversity, an undergraduate course focusing on “the use of digital methods to analyze and archive literary texts, with particular attention to issues of diversity and inclusion”. Students in the class will use the Toolkit to explore “how computers, databases, and analytical tools give substance to concepts of aesthetic, cultural, and intellectual value as inflected by race and gender.” This project will be the first to use the CERES classroom teaching materials originally developed for Nicole Aljoe’s award-winning Writing Black Boston class, which used the Toolkit to create the Early Black Boston Digital Almanac (still in development). To increase the breadth of materials available to the class (and other site builders), we will also consider adding Europeana as an additional data source for Toolkit materials (similar to the DPLA connection built in 2016).   We also continue work with our partners on the 2015 and 2016 projects: For more information about these projects, visit the DSG website (about the projects, about CERES) or contact us.