Digital Production Services

Using AI to Automate Library Captioning

Captions play a key role in making audio and video content accessible. They benefit not only deaf and hard-of-hearing users, but also second-language learners, researchers scanning interviews, and anyone viewing content in noisy environments.

At the Northeastern University Library, we manage a growing archive of media from lectures and events to oral histories. Manually creating captions for all of this content is not a scalable solution, and outsourcing the task to third party services can be expensive, time-consuming, and inconsistent. Motivated by this, we have been exploring AI-powered speech-to-text tools that could generate high-quality captions more efficiently and cost-effectively.

Screenshot of a video with a person speaking and a caption reading "There is an enormous need for an expansion of imagination and"
Figure 1: Example of an ideal transcription output

We started by testing Autosub, an open-source tool for generating subtitles. Even using a maintained fork (copies of the original project that add features, fix bugs, or adapt the code for different use cases), Autosub did not offer significant time savings, and it was eventually dropped.

In summer 2023, the team began using OpenAI’s Whisper, which immediately cut captioning time in half. However, it lacked key features like speaker diarization (the process of segmenting a speech signal based on the identity of the speakers), and it often stumbled on long stretches of music or background noise which would require extra cleanup and made the output harder to use at scale.

As the AI for Digital Collections co-op on the Digital Production Services (DPS) team, I was responsible for researching and testing Whisper forks that could be realistically adopted by our team. I tested model performance, wrote scripts to automate captioning, debugged issues, and prepared tools for long-term use within our infrastructure.

Phase 1: Evaluating Whisper Forks

We looked for a model that could:

  • Handle speaker diarization
  • Distinguish between speech and non-speech (music, applause, etc.)
  • Output standard subtitle formats (like VTT/SRT)
  • Be scriptable and actively maintained

We tested several forks, including WhisperX, Faster Whisper, Insanely Fast Whisper, and more. Many were either too fragile, missing key features, or poorly maintained. WhisperX stood out as the most well-rounded: it offered word-level timestamps, basic diarization, reasonable speed, and ongoing development support.

Phase 2: Performance Testing

Once we chose WhisperX, we compared its various models to OpenAI’s original Whisper models, including large-v1, v2, v3, large-v3-turbo, and turbo. We tested six videos, each with different lengths and levels of background noises, and compared the models based on Word Error Rate (WER) (how often the transcription differed from a “gold standard, or human-created or -edited transcript), and processing time (how long it took each model to generate captions).

WhisperX’s large-v3 model consistently performed well, balancing speed and accuracy even on noisy or complex audio. OpenAI’s turbo and large-v3-turbo delivered strong performance but lacked diarization features.

Phase 3: Timestamp Accuracy Evaluation

Next, we assessed how precisely each model aligned subtitles to the actual audio — crucial for usability. We compared outputs from the WhisperX large-v3 model and the OpenAI turbo and large-v3-turbo models.

We used a gold standard transcript with human-reviewed subtitles as our benchmark. For each model’s output, we measured:

  • Start Mean Absolute Error (MAE) — average timing difference between predicted and actual subtitle start times
  • End MAE — same as Start MAE, but for subtitle end times
  • Start % < o.5s — percentage of subtitles with start times less than 0.5 seconds off
  • End % < 0.5s — same for start % < 0.5s, for end times
  • Alignment rate — overall percentage of words correctly aligned in time

WhisperX’s large-v3 model outperformed all other models significantly. In most of our test videos, it showed:

  • Much lower MAE scores for both start and end timestamps
  • Higher percentages of accurately timed subtitles (within the 0.5-second range)
  • Better overall word alignment rates

In fact, in several test cases, WhisperX was nearly three times more accurate than the best-performing OpenAI Whisper models in terms of timing precision.

Coded two-page caption
Figure 2: WhisperX output vs. gold-standard transcript in a high-WER case

In one particular case, one WER result for WhisperX large-v3 showed a surprisingly disappointing score of 94% errors. When I checked the difference log to investigate, it was that the model had transcribed background speech that was not present in the gold standard transcript. So, while it was technically penalized, WhisperX was actually picking up audio that the gold standard did not include. This highlighted both the model’s sensitivity and the limitations of relying solely on WER for evaluating accuracy.

Figure 2 shows exactly that. On the left, WhisperX (denoted “HYP”) transcribed everything it heard, while the gold standard transcript (denoted “REF”) cut off early and labeled the rest as background noise (shown on the right).

What’s Next: Integrating WhisperX

We have now deployed WhisperX’s large-v3 model to the library’s internal server. It’s actively being used to generate captions for incoming audio and video materials. This allows:

  • A significant reduction in manual labor for our DPS team
  • The potential for faster turnaround on caption requests
  • A scalable solution for future projects involving large media archives

Conclusion

As libraries continue to manage growing volumes of audio and video content, scalable and accurate captioning has become essential, not only for accessibility, but also for discoverability and long-term usability. Through this project, we identified WhisperX as a practical open-source solution that significantly improves transcription speed, speaker diarization, and timestamp precision. While no tool is perfect, WhisperX offers a strong foundation for building more efficient and inclusive media workflows in the library setting.

Reflections and Acknowledgements

This project helped me understand just how much thought and precision goes into building effective captioning systems. Tools like WhisperX offer powerful capabilities, but they still require careful evaluation, thoughtful tuning, and human oversight. I am incredibly grateful to have contributed to a project that could drastically reduce the time and effort required to caption large volumes of media, this way enabling broader access and creating long-term impact across the library’s AV collections.

Finally, I would like to thank the Digital Productions Services team for the opportunity and their guidance and support throughout this project — especially Sarah Sweeney, Kimberly Kennedy, Drew Facklam, and Rob Chavez, whose insights and feedback were invaluable.

Issue and Inquiry and Urban Confrontation: Two Radio Programs Covering Urban Issues in Uncertain Times

Two radio program collections available in the Digital Repository Service (DRS) — Issue and Inquiry and Urban Confrontation — document social progress and unrelenting difficulties within American cities in 1970-71. Airing on Northeastern University’s radio station WRBB, the programs were produced the university’s now-defunct Division of Instructional Communication. (Urban Confrontation noted that it ended in 1971 for financial reasons.)

Black and white image of two students sitting in a recording studio. They are wearing headphones and sitting at a table while surrounded by 1960s-era recording equipment
Students working in the WRBB (then WNEU) radio station in 1969. Photo courtesy of Northeastern University Archives and Special Collections.

Episodes were primarily hosted by Joseph R. Baylor and feature interviewees from across the United States discussing wide-ranging topics. From the threat of nuclear warfare to the farm labor rights movement, from the “longhair” youth subculture to de-facto school segregation, these episodes present a sweeping view of both common anxieties and optimistic ideas about the future of city life.

As a metadata assistant in Digital Production Services, I performed a survey of the episodes and their associated metadata records. This helped me understand how descriptive information should appear in the DRS. For example, I investigated how titles, creators, subjects, and abstracts should be recorded for each episode. Next, I created an editing plan, performed batch edits, and carefully listened to each episode. As I listened, I recorded accurate information about the episodes so it could be updated in the DRS.

I selected two interesting episodes to highlight here, but be sure to check out the full collection for more episodes.

Oil in Santa Barbara: The Pollution Tragedy (Issue and Inquiry, Episode 10)

In this episode from 1970, Al Weingand, Bob Solan, and Dick Smith discuss a Union Oil offshore drilling well explosion that occurred on January 28, 1969, expelling two million gallons of uncontrolled oil into Santa Barbara Channel off the coast of California. Topics include the oil’s effect on tourism, local economy, wildlife, fishing, and environmental safety concerns.

Weingand, a Santa Barbara resident and former California legislative member, explains that no other disasters can compare to the devastation of the oil pollution. Smith, a reporter for the Santa Barbara News Press, calls for greater investment in tourist value of beaches, saying that offshore oil well spills are dangerous both environmentally and economically. Solan, another reporter for the News Press, covers the psychological benefits of beautiful surroundings for Santa Barbara residents.

This episode was produced in a time of evolving standards for environmental safety and presents an intimate view of lives affected by oil pollution.

Afro-American Culture: The Black Artist Unchained (Urban Confrontation, Episode 11)

“The business that I am about is resurrecting that dormant conscious pride that Black people have had and should have.” — Elma Lewis (4:57)

In this episode, airing in 1970, arts educator and activist Elma Lewis discusses the intertwined histories of Black labor and Black cultural impact in America. She speaks critically of modern art because she says it lacks a basis in life experience. This, Lewis explains, is why Black contributions to American culture transcend art and extend to labor and life experience, which has formed the basis of American society. Throughout the program, Baylor asks Lewis to respond to common racist comments about Black culture. Despite Baylor’s insistence that Lewis speak to his white audience, she intentionally denies this request. Laughing, she replies, “I don’t answer nonsense. I’m not in the business of answering nonsense.”

For more information on Elma Lewis, explore the DRS. More materials from the Elma Lewis collections (Elma Ina Lewis papers, Elma Lewis School of Fine Arts records) are expected to be available in the DRS in 2026.

I wanted to highlight these two episodes because they made me think deeply about both everyday problems and large socio-political injustices which continue to affect us today. “Oil in Santa Barbara” presents opinions from concerned community members in California. It focuses on their reaction to environmental pollution, showing common anxieties about business success, health, and the beauty of their local natural environment. By contrast, “Afro-American Culture” features distinguished Black arts educator Elma Lewis. She discusses fine arts movements, while also celebrating Black joy and artistry in the face of wide-scale systemic racism.

I greatly enjoyed the opportunity to help make these shows available in the DRS. Both Issue and Inquiry and Urban Confrontation hold potential research value for those interested in viewing snapshots of American life in the early 1970s.

Chelsea McNeil served as a part-time metadata assistant in Digital Production Services.

Scan It Right: Starting Your Own Digitization Project

Whether you are digitizing old family photos or creating a paperless record-keeping system, reformatting analog materials can be a lot of work! Here are some suggestions for what to think about when starting a project.

Documents, Photographs, Flat Art, Slides, and Negatives

Two unidentified women and one man standing in front of a computer.
Computer training course sponsored by New England
Telephone and Telegraph Company. https://repository.library.northeastern.edu/files/neu:126895

Choosing a Scanner
A paper-based document, such as a report, on normal paper can go through a sheetfed scanner.

Photographs, artwork, or material on old or delicate paper should go on a flatbed scanner.

Slides and negatives can go on specialized scanners or on multipurpose scanners.

While all of these material types can be scanned in a home or office, if you are dealing with many items, it can be more efficient to send them to a vendor.

File Type
TIFF is one of the standard file types for scanned images and an excellent choice for saving high-quality images long-term. If you would like to read more about standards for digitization, check out the FADGI Technical Guidelines for Digitizing Cultural Heritage Materials. If you need smaller files, you can use Photoshop to save TIFF images as JPEG files. The Northeastern community has free access to the Adobe Creative Cloud, which includes Photoshop and Acrobat.

PDF is a good file type for documents. Some scanners will let you save automatically to PDF. You could also save the document pages as TIFF files, then use Adobe Acrobat to combine the files into a PDF.

File Naming
Give your files unique and descriptive names and avoid spaces in the names — use underscores, dashes, or camel-case instead. Think about how the file names will sort in Finder or Windows Explorer. Some examples:

  • Faculty_Report_1970_01.pdf
  • ChemBuilding001.tif, ChemBuilding002.tif, etc.
Paul Mahan from the Boys' Clubs of Boston using an enlarger at a photographic laboratory
Paul Mahan from the Boys’ Clubs of Boston
using an enlarger at a photographic laboratory. https://repository.library.northeastern.edu/files/neu:212609

Resolution
Resolution is how many pixels the scanner captures per inch of the original material. This is usually expressed in ppi (pixels per inch) or dpi (dots per inch). A higher dpi will capture more detail but will result in a larger file size.

Based on the FADGI guidelines mentioned above, for text-based materials like journal articles or reports, 300 dpi is sufficient for most uses. For photographs and more image-heavy material, use 400 dpi. For slides or negatives, use around 3000 dpi.

Black and White, Grayscale, or Color
You can base this on the material you are scanning. If the material is entirely black and white or grayscale, then you can scan in black and white or grayscale. If the item has color that you want to capture, then scan in color.

Brightness, Contrast, and Cropping
Most scanners will allow you to adjust brightness and contrast settings. If you are scanning documents, adjust until the text appears solid (not choppy but not too dark or blown out). For images, adjust until the brightness and contrast look true to the original.

Text Searchability
If you are creating a PDF, in most cases it should be text searchable for accessibility. To do this, you need to run OCR (Optical Character Recognition) on the document. For members of the Northeastern community, this is available in Adobe Acrobat.

Audiovisual Material

If you want to reformat A/V material (like VHS and audiocassette tapes) yourself, the following webinars from Community Archiving Workshop provide some guidance on the type of equipment to purchase.

However, it is often easiest to work with a vendor for A/V transfers. These materials can suffer from degradation that makes them challenging to capture. The Association of Moving Image Archivists has a directory of vendors.

In addition, the following guides from the National Archives and Records Administration can help you identify formats in your possession before you talk with a vendor. The first focuses on audio formats, like cassette tapes, and the second focuses on video formats, like VHS tapes.

Storage

For the files you create, make sure you save multiple copies in different geographic locations. For example, you might save one copy on your computer; another in a cloud-based location, like Backblaze or Google Drive; and then the final copy on an external hard drive.

You can also share files with friends and family through shared folders on Google Drive. For A/V materials, you can post unlisted videos on YouTube, so folks can only view them if they have a link.

Have any questions? Feel free to contact the librarians in Digital Productions Service at Library-DPS@northeastern.edu.