Data Journeys Through the Global Research Infrastructure

Ted Habermann (0000-0003-3585-6733, Metadata Game Changers

Jamaica Jones (0000-0002-1969-2508), University of Pittsburgh

Cite this blog as Habermann, T. and Jones, J. (2025). Data Journeys Through the Global Research Infrastructure. Front Matter. https://doi.org/10.59350/1xe6n-a1x59.

The Global Research infrastructure (GRI) is made up of the repositories and organizations that provide persistent identifiers (PIDs) and metadata for many kinds of research objects and connect these objects to funders, research institutions, researchers, and one another using PIDs. This infrastructure is large and growing quickly:

  • Crossref now holds metadata for over 110 million journal article DOIs and over 50 million other kinds of research objects.  

  • DataCite now holds metadata for nearly 30 million datasets and over 45 million other types of resources with over 70 million relations and 43 million contributors.

In addition, there are many other organizations that are providing diverse overviews that combine these sources and many others. 

Any interconnected system this complex and dynamic offers many approaches to answering any question and a plethora of correct answers that depend on combinations of organizations and people that collect and publish data, sources, queries, tools, and timing. These combinations reflect different journeys between and through places where people are engaged in practices of data production, processing, distribution and use, termed Data Journeys by Bates et al., 2016.

The INFORMATE Project has combined three data sources to explore how the global research infrastructure might help the US National Science Foundation (NSF) and other federal agencies identify and characterize the impact of their support. The data sources are the NSF Award Database, the NSF Public Access Repository (PAR), and the global research infrastructure as viewed through CHORUS. The Award Database, created and managed by NSF, has the most straightforward data journey and is assumed to be complete. PAR and CHORUS reflect more complex data journeys and therefore different views and coverage of work funded by NSF or other agencies.

Some results of that work were presented at the NISO Plus Global meeting during September 2024 and the American Geophysical Union Meeting during December 2024. Those talks have been combined with some other recent results to provide an overview of a variety of data journeys.

The results for NSF demonstrate that a significant number of awards and research results that are not included in the NSF Public Access Repository can be discovered in the global research infrastructure (Figure 1, Habermann, T. and Jones, J. 2024)

Figure 1. A sample of 51,602 awards with 308,549 associated articles (DOIs) was selected from CHORUS and searched for in PAR. We found that 32,543 (63%) of the awards and 127,218 (41%) of the DOIs were in PAR.

The temporal history of these numbers (Figure 2) shows that the portion of published articles included in PAR increased significantly for awards with effective dates of 2017 and 2018 while the number of articles identified only in CHORUS decreased. It also shows an increase in the number of articles that are only in PAR. These may represent published articles without clear acknowledgements.

Figure 2. Time history of award references in CHORUS and PAR.

In addition to the NSF results, we make comparisons of DataCite Commons to Crossref and to ORCID profiles to help understand the data journeys into DataCite Commons. We find that Commons and Crossref agree on funder metadata in roughly half of the cases and that ORCID profiles include more works for researchers than Commons.

Travel Tips

This work suggests several tips to help data and repository managers and other travelers through this infrastructure:

  • Researchers, Institutional Repositories, and Publishers: maximize benefits with complete, consistent, and well-connected metadata.

  • Repositories: choose tools that include update capabilities so new connections can be identified and added to metadata.

  • All: measure and recognize bright spots and share success. 

Resources

This talk is available as slides and a video.

Reference

Bates, Jo, Lin, Y.-W., & Goodale, P. (2016). Data journeys: Capturing the socio-material constitution of data objects and flows.https://doi.org/10.1177/2053951716654502

Habermann, T. and Jones, J. (2024). The Global Research Infrastructure and the NSF Public Access Repository. Front Matter. https://doi.org/10.59350/s2mgr-mwt45