Ted Habermann, Metadata Game Changers

Jamaica Jones, University of Pittsburgh

Howard Ratner and Tara Packer, CHORUS

Introduction

Cite this blog as Habermann, T. , Jones, J., Ratner, H., and Packer, T. (2023). CHORUS Data Journeys. Front Matter. https://doi.org/10.59350/ksgzn-a6w37

A recent blog post described a new partnership between Metadata Game Changers and CHORUS aimed at understanding how CHORUS reports can help federal agencies, other funders, and other users access and use information from the global research infrastructure to understand impact and connections between people, organizations, and research objects.

The CHORUS Dashboard provides visualizations and a variety of reports for Federal agencies and other users (Figure 1). This project focuses on three of the CHORUS Reports: All, Author Affiliation and Dataset. The goal of the first phase of the project is to understand the contents of the CHORUS reports and the data collection and processing that bring the data to the reports, i.e. the CHORUS Data Journey.

Figure 1. CHORUS data are available in reports that provide analysis ready data for answering many questions about federally funded research. We will focus on the All, Author Affiliation and Dataset reports.

One of our goals is to help increase visibility by looking more deeply into the data while exposing it to users, a process termed informating. In this blog post we focus on the CHORUS Data Journeys.

Three Reports – Three Journeys

The CHORUS reports we are focusing on provide an overview of metadata for journal articles in Crossref (the All Report), authors (the Author Affiliation Report), and datasets in DataCite (the Dataset Report). All together there are 83 metadata elements that represent 51 documentation concepts included in these three reports.

Figure 2 shows the three reports considered, the documentation concepts they include, and the sources of the metadata. Some elements, near the center of the Figure, are shared by and connect all three reports and some, in the upper right), are shared by the All Report and the Authors Report.

Figure 2. CHORUS reports considered here and documentation concepts they include.

CHORUS retrieves metadata from multiple sources, indicated by the colors in Figure 2, and combines them in these reports. Understanding the journeys the data take through the global research infrastructure is critical to understanding how the data can be used (Figure 3).

The CHORUS Data Journeys start with funding agencies (with Funder Identifiers) supporting researchers (with ORCIDs) that write journal articles and create datasets. The articles are published in scientific journals that send metadata to Crossref and create a digital object identifier (DOI) for the article. Some researchers also create or collect data and register those data with DataCite to receive a dataset DOI. These identifiers and the metadata associated with them are the lifeblood of the infrastructure and of CHORUS.

CHORUS regularly queries the Crossref API using the Funder Name and the Funder Identifier (Figure 3A) and retrieves metadata (Figure 3B) for journal articles that acknowledge the funder for supporting their work. This Crossref metadata forms the basis for all CHORUS reports (Figure 3C) and the Crossref DOIs are the keys to more detailed metadata for authors and datasets.

CHORUS then queries ScholeXplorer, a collection of over 300 million links, with the journal DOIs (Figure 3D) to find datasets linked to the articles, and dataset DOIs for those datasets. Those DOIs (purple in Figure 3) go into the All Report and are used to query the DataCite API to retrieve dataset metadata. That metadata (Figure 2, Figure 3E) goes into the CHORUS dataset report along with Crossref metadata (Figure 2). Note that most of the metadata in the Datasets Report (Figure 2) comes from DataCite.

Understanding this journey is important because it includes several obstacles that must be overcome for datasets to be included in the CHORUS Datasets Report:

The researchers must provide funder metadata (name and identifier) that the journal publisher includes in the Crossref metadata,
The dataset created or used must be in a repository and must have a DataCite DOI,
The link between the article and the dataset DOI must be included in ScholeXplorer.

The journey to the Authors report starts with a query to the ORCID API using the article DOI from Crossref (Figure 3F). This query returns ORCIDs that are associated with the DOI. These are then included in the CHORUS All and in the Author Reports (Figure 3) along with metadata from the original Crossref query.

The metadata in the Author report faces obstacles similar to those mentioned above:

The researchers must have ORCIDS,
The metadata associated with the ORCIDs must be publicly available,
The ORCIDs must be associated with the articles by the journal or the author.

Conclusion

CHORUS acts as a simplifying lens for the vast collection of data and metadata termed the Global Research Infrastructure. As described here, CHORUS retrieves metadata from multiple sources in that infrastructure and makes it available as visualizations and reports that are compatible with many analytical tools.

The journeys taken by those metadata rely on help from many contributors along the way: funders with identifiers support researchers with identifiers who do research and submit data and results to repositories and journals. Publishers and repositories collect article metadata and provide it to Crossref, repositories collect dataset metadata and provide it to DataCite, Crossref, DataCite, and ORCID provide persistent identifiers while ScholeXplorer collects links. All of this information is openly available to CHORUS that pulls it together and shares it in analysis-ready formats.

This system works very well, providing over 500,000 connected resources for just the three agencies included in this study (NSF, USGS, and USAID) and many more overall. At the same time, systems with many moving parts invariably introduce friction (Edwards et al., 2011) which must be overcome to achieve all of the system potential. While CHORUS minimizes friction by organizing and simplifying data formats, much of the systemwide friction can be minimized with identifiers and complete metadata for everything and open archives that build on those identifiers to make connections.

We started by understanding the data journeys and examining repositories that hold datasets connected to papers funded by these three agencies. In future blogs we will examine other aspects of the CHORUS reports including when papers and data are published, metadata completeness and connectivity to characterize the whole picture.

References

Crossref, Crossref API Documentation, https://www.crossref.org/documentation/retrieve-metadata/rest-api/

DataCite, DataCite REST API Guide, https://support.datacite.org/docs/api

Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. Science friction: data, metadata, and collaboration. Soc Stud Sci. 2011 Oct;41(5):667-90. doi: 10.1177/0306312711413314. PMID: 22164720.

Habermann, T. (2023). INFORMATE: Metadata Game Changers and CHORUS Collaborate to Make the Invisible Visible. https://doi.org/10.59350/yqkat-59f79

ORCID, Public API, https://info.orcid.org/documentation/features/public-api/

Wikipedia contributors. (2023, March 17). Informating. In Wikipedia, The Free Encyclopedia. Retrieved 00:58, March 12, 2024, from https://en.wikipedia.org/w/index.php?title=Informating&oldid=1145106790

Blog

CHORUS Data Journeys