Data Management

New policies will impact research data sharing and scholarly communication

by Jen Ferguson
October 20, 2022October 24, 2022
Data Management, Library News, Scholarly Communication

We’re monitoring recent changes to policy and legislation that will likely impact the work of Northeastern University faculty, staff, and student researchers. Read on for a brief overview of three of these impending changes, in order of their expected implementation dates.

NIH (National Institutes of Health) Policy for Data Management and Sharing

What is it? The NIH’s new policy on data management and sharing aims to improve the reproducibility and reliability of NIH-funded work by broadening access to research uploads.

When will the changes take place? January 25, 2023

How might this impact researchers?

DMSPs: All NIH proposals will require the submission of a data management and sharing plan (DMSP). Previously, only NIH proposals above a certain funding threshold required a DMSP.
Data availability: Research data is expected to be made accessible “as soon as possible, and no later than the time of an associated publication, or the end of the award/support period, whichever comes first.” Further, the new policy strongly encourages the use of established repositories to share data.
Costs: Reasonable costs related to data management and sharing may be included in NIH budget requests.

Additional resources:

Guidance and support at Northeastern for creating DMSPs
Consider Northeastern’s Digital Repository Service as an option for sharing data sets
For more information, see the DMS Plan FAQs from Research Compliance

CHIPS and Science Act

What is it? The CHIPS and Science Act is primarily related to semiconductor manufacturing and the STEM workforce pipeline, but also includes some open science directives.

When will the changes take place? One year following enactment (circa September 2023)

How might this impact researchers? Once the act takes effect, applications for National Science Foundation awards will be required to include machine-readable data management plans (DMPs). We do not anticipate that this will significantly impact researchers, as most DMPs are already machine-readable unless they include tables or charts. This requirement will enable more seamless information sharing between systems used by institutions and funders, ultimately reducing the paperwork burden for researchers.

Additional resource:

DMPTool provides the creation and export of machine-readable DMPs

White House Office of Science & Technology Memorandum: Ensuring Free, Immediate, and Equitable Access to Federally Funded Research

What is it? OSTP’s new memorandum (aka the Nelson memo) builds upon OSTP’s 2013 Holdren memo. The new memo will make research funded by all U.S. government agencies immediately available to the public. This eliminates the current optional 1-year embargo period and applies to both publications and the data underlying peer-reviewed research. Under the new Nelson memo, the definition of publications is widened beyond articles to also apply to peer-reviewed book chapters and conference proceedings.

When will the changes take place? The Nelson memo will first impact funding agencies, which will be expected to fully implement their public access and data sharing plans by the end of 2025.

How might this impact researchers? Once the memo goes into effect, researchers and members of the public will benefit from broader, more immediate access to federally funded research results. The memo urges the use of persistent identifiers (PIDs) to unambiguously identify authors, affiliations, funders, and more, so this would be a great time to acquire and begin using an ORCID iD if you don’t already have one. The U.S. government has also signaled interest in examining current academic incentive structures to better recognize institutions and researchers for their support of public access to research.

Additional resources:

SPARC factsheet on the new OSTP memo
Register for an upcoming Research Compliance workshop on persistent identifiers (including ORCIDs)

The library is working with campus partners, including Research Administration and Research Computing, to develop guidance and resources to assist researchers as they navigate these changes. As always, if you need assistance with a data management or data management and sharing plan, or if you’re searching for a secure, permanent home for your research outputs, we’re here to help!

Katz Tapes Provide Valuable Resource on History of Music Industry

by Giordana Mecagni
May 3, 2022May 3, 2022
Archives and Special Collections, Collections, Data Management, Library News, Open Access, Read, Listen, Watch

This blog post was written by Sean Plaistowe and edited by Molly Brown and Giordana Mecagni for clarity.

Larry Katz is a music journalist who spent a long career working at Boston-area newspapers and magazines. While collecting information for upcoming articles, it became his practice to record the interviews with musicians and artists and put them aside in case they proved useful in the future. Over time, he amassed a collection of over 1,000 of these interviews, with artists as diverse as Eartha Kitt, Carly Simon, D.J. Fontana (the drummer for Elvis), Aerosmith, David Bowie, Ornette Coleman, Aretha Franklin, Bob Marley, James Brown, Miles Davis, and Elmore Leonard, as well as actors including Ted Danson, Mel Brooks, and Loretta Devine.

In 2020, Larry donated his collection to the Northeastern University Archives and Special Collections (NUASC).

A collage of various musicians and artists. At the center is a cassette tape that is labeled "The Katz Tapes"

These interviews create a fascinating resource that provides insight into the music and arts industry across a wide variety of genres and eras. In them, you can catch some novel and intimate moments of music history. On one tape, you’ll hear Weird Al Yankovic discussing the difficulties of obtaining permission to parody Eminem’s music. Other tapes with artists like Nina Simone or Aimee Mann discuss musical influences or even the challenges and biases of navigating the recording industry. These interviews contain countless quiet moments as well, such as Prince discussing his preference for his home in Minneapolis over either coast, as well as his favorite movies of the year. The quiet clicking of teacups connecting with saucers while Eartha Kitt discusses her career provides a welcome feeling of connection and belonging that can feel rare and precious in researching these figures or music journalism more generally.

Black and white image of a man with curly dark hair and a collared shirt. — Larry Katz. Photo courtesy of The Katz Tapes website.

After graduating from the Manhattan School of Music in 1975, Larry Katz worked as a bass player before starting his journalism career at Boston’s Real Paper in 1980. In 1981, Larry worked as a freelance music writer at the Boston Globe and Boston Phoenix before being hired at the Boston Herald as a features writer, where he covered a wide variety of arts and lifestyle beats before settling into a role as a music critic and columnist. In 2006, he became the Herald’s Arts Editor and in 2008, he took over the features department, a role he had until 2011.

In 2013, Larry revisited his tape collection. Re-listening to the interviews sparked memories of the circumstances and contexts that these recordings were made in, information he felt compelled to share. He started a blog, The Katz Tapes, where he began to write reflections on artists and their interviews, often taking into account events that had transpired since the original conversations. Along with these reflections, Larry provided a transcription of the recorded interviews which he often interspersed with links to notable performances or songs related to the artists. Larry also donated the contents of this blog to the NUASC.

Making this collection usable and accessible to the public has involved many hands and collaborations, both internal and external. First, the tapes were digitized by George Blood LP, with funding generously provided by the Library of the Commonwealth program run by the Boston Public Library. Once the digitized tapes were safely back in the hands of the NUASC collections staff, the files were then handed to the Digital Production Services department to do the painstaking work of processing and cataloging the collection. They split audio files that contained multiple interviews, combined interviews that were on multiple tapes edited out white space, and created catalog records.

Making the blog content available was another challenge. Despite already being digital, moving content from Larry’s independent site to Northeastern hosting proved difficult. Initially, I was hopeful that we could use a handy WordPress feature that would allow for the whole cloth export of his blog. No such luck. Instead, I found some scripts which allowed me to scrape the many unique images which Larry had included with each post. The blog also linked to a lot of songs and performances hosted on YouTube, but unfortunately, due to the vagaries of time and copyright law, many of these videos were removed. When possible, I attempted to restore links to sanctioned videos. As an added feature, I created a playlist that includes many of the songs referenced in these posts.

Now that the collection has been cataloged and the blog has been ingested, we welcome anyone to search for their favorite artist, listen to their interview, read some of the reminiscences and insights form Larry about the artist and the interview, and listen to a Spotify playlist of some of the artists Larry interviews at thekatztapes.library.northeastern.edu.

In addition to the Larry Katz collection, researchers and enthusiasts of the arts in Boston may be interested in the Real Paper records and the Boston Phoenix records, both available at the NUASC.

A brief overview of machine learning practices for digital collections

by Sarah Sweeney
September 26, 2021September 27, 2021
Archives and Special Collections, Collections, Data Management, Library News

Northeastern University Library’s procedure for digitizing physical materials utilizes a few different workflows for processing print documents, photographs, and analog audio and video recordings. Each step in the digitization workflow, from collection review to scanning to metadata description, is performed with thorough attention to detail, and it can take years to completely process a collection. For example, the approximately 1.6 million photographs in The Boston Globe Library collection held by the Northeastern University Archives and Special Collections may take several decades to complete!

What if some of these steps could be improved by using artificial intelligence technologies to complete portions of the work, freeing staff to focus more effort on the workflow elements that require human attention? Read on for a very brief overview of artificial intelligence and three potential options for processing The Boston Globe Library collection and other digital collections held by the Library.

A three-part cycle, with "Input" leading to "Model Learns and Predicts" leading to "Response" leading back to "Input"

What is artificial intelligence and machine learning?
Artificial intelligence (AI) is a broad term used for many different technologies that attempt to emulate human reasoning in some way. Machine learning (ML) is a subset of AI where a program is taught how to learn and reason on its own. The program learns by using an algorithm to process existing data and find patterns. Every pattern prediction is evaluated and scored according to how accurate the prediction may or may not be until the predictions reach an acceptable level of accuracy.

ML may be supervised or unsupervised, depending on the type of result needed. Supervised learning is when instructions are provided to assist the algorithm to learn how to identify patterns expected to the researcher. Unsupervised learning is when the algorithm is fed data and discovers its own patterns that may be unknown to the researcher.

Ethics
As we undertake this work, it is important to be aware that AI technologies are human-made and therefore human biases are embedded directly within the technology itself. Because AI technologies can be employed at such a large scale, the potential for negative impact caused by these biases is greater than with tools that require standard human effort. Although it is tempting to adopt and employ a useful technology as quickly as possible, this is an area of research where it is imperative that we make sure the work aligns with our institutional ethics and privacy practices before it is implemented.

What AI or ML techniques could be used to help process digital collections?
OCR: The most widely known and used form of AI in digital collections practices may be recognition of printed text using Optical Character Recognition, or OCR. OCR is the process of analyzing printed text and extracting the text objects, like letters, words, sentences. The results may be embedded directly in the file, like a PDF with OCR’d text, or stored separately, like in a METS-ALTO file, or both.

Screenshot of the front page of the Winchester News — Image source: Screenshot of an OCR page of *The Winchester News* with METS-ALTO encoding opened in AltoViewer.

OCR works rather well for modern text documents, especially those in English, but a particular challenge for OCR is historical documents. For more about this challenge, I recommend A Research Agenda for Historical and Multilingual OCR, a fairly recent report published by NULab.

A screenshot of a search result that reveals the result was returned because the search term matched OCR'd text within the document.

We can already see the benefit of using OCR in the library’s Digital Repository Service, as files with OCR text embedded in the file have the full text extracted and stored alongside the text file. That text is indexed and improves discoverability of text files by retrieving files that match search terms in the file’s metadata or the full text.

The back of a photograph from the Boston Globe Library Collection, featuring difficult-to-read handwritten descriptions. — Digitized back of a photograph from The Boston Globe Library collection.

HTR: Handwritten Text Recognition, or HTR, is like OCR, but for handwritten, not typewritten, text. Handwriting is very unique to an individual and poses a difficult challenge for teaching machines to interpret it. HTR relies heavily on having lots of data to train a model (in this case, lots of digitized images of handwriting), so even once a model is accurately trained on one set of handwriting, it may not be useful for accurately interpreting another set. Transkribus is a project attempting to navigate this challenge by creating training sets for batches of handwriting data. Researchers submit at least 100 transcribed images for a particular handwriting set to Transkribus and Transkribus uses that set as training data to create an HTR model to process the remaining corpus of handwritten text. HTR is appealing for the Boston Globe collection, as the backs of the photographs contain handwritten text describing the image, including the photographer name, date the photograph was taken, classification information, and perhaps a description or an address.

Computer Vision: Computer vision refers to AI technologies that allow machines to work with images and video, essentially training a machine to “see”. This type of AI is particularly challenging because it requires the machine to learn how to observe and analyze a picture and understand the content. Algorithms for computer vision are trained to identify patterns of different objects or people and attempt to accurately sort and identify the patterns. In a picture of the Northeastern campus, for example, a computer vision algorithm may be able to identify building objects or people objects or tree objects.

A black and white photograph of a man being arrested by two police officers next to an analysis of the photo's contents: Footwear (98%); Shoe (96%); Gesture (85%); Style (84%); Military Person (84%); Black-and-white (84%); Military Uniform (80%); Cap (80%); Hat (78%); Street Fashion (75%); Overcoat (75%) — Result of Google Cloud’s Vision API analysis for a black and white photograph.

When used in digital collections workflows, the output produced by computer vision tools will need to be evaluated for its usefulness and accuracy. In the above example, the terms returned to describe the image are technically present in the photo (the subjects are wearing shoes and hats and overcoats), but the terms do not adequately capture the spirit of the image (a person being detained at a demonstration).

There are a lot of ethical concerns about using computer vision, especially for recognizing faces and assigning emotions. If we were to employ this particular technology, it may be able to generate keywords or other descriptive metadata for the Boston Globe collection that may not be present on the back of an image, but we would need to be careful to make sure that the process does not embed problematic assessments into the description, like describing an image of a protest as a riot.

Computer vision is already being employed in some digital collection workflows. Carnegie Mellon University Libraries has developed an internal tool called CAMPI to help archivists enhance metadata. An archivist uses the software to tag selected images, then the program returns other images it identifies as visually similar, regardless of its box and folder, allowing the archivist to easily apply the same tags to those visually similar images without having to manually seek them out.

Many other aspects of AI and ML technologies will need to be researched and evaluated before they can be integrated into our digital collections workflows. We will need to evaluate tools and identify the skills that are needed to train staff to perform the work. We will also continue to watch leaders in this space as they dive deep into the world of artificial intelligence for library work.

Recommended resources:
Machine Learning + Libraries: A Report on the State of the Field / Ryan Cordell : https://blogs.loc.gov/thesignal/2020/07/machine-learning-libraries-a-report-on-the-state-of-the-field/
Digital Libraries, Intelligent Data Analytics, and Augmented Description / University Of Nebraska–Lincoln: https://digitalcommons.unl.edu/libraryscience/396/

Learn to Write a Data Management Plan, Find Out What Social Media Knows About You, and More

by Jen Ferguson
February 7, 2019October 24, 2019
Data Management, Experiential Research Library, Teaching and Learning

"You Are Here" artwork by Mario Klingemann

How does your commute make you feel? Map it! What does Facebook know about you? Download your data! What do you need to say about your data in a grant proposal? Learn about data management plans!

We’re hosting a few events this month to coincide with Love Data Week and Endangered Data Week, and you’re invited to:

Learn the components of strong data management plans for grant proposals
What does your social media know about you? Request your data and find out!
Map your commuting-related emotions with our GIS Specialist, Bahare Sanaie-Movahed
Learn about projects that use or critique governmental data and draw attention to state violence

Check out the full lineup and register for your spot: bit.ly/snelldata19

“You Are Here” by Mario Klingemann on Flickr, CC BY 2.0

BYO data & code! Preparing for reproducible publication workshop

by Jen Ferguson
November 1, 2018November 1, 2018
Data Management, Library News

We’re bringing Code Ocean to campus on Nov. 8th for a hands-on, interactive workshop.

This 2-hour session is a unique opportunity to bring order to your own data and/or code! You’ll receive expert, step-by-step guidance on:

Organizing your files
Creating a codebook (so that others – not to mention your ‘future self’ – can understand how & why you’ve done things)
Preparing your code & data for documentation and reuse
Maximizing the potential reproducibility of your research outputs

Space is limited to about 20 attendees, so please register soon to reserve your place. More info and registration link is here. Questions? Contact Jen Ferguson or Tom Hohenstein.

Library News