Ted Habermann, Metadata Game Changers

Jamaica Jones, University of Pittsburgh

Howard Ratner, Tara Packer, CHORUS

Introduction

Cite this blog as Habermann, T., Jones, J., Ratner, H., Packer, T., (2024). Funder Acronyms Are Still Not Enough. Front Matter. https://doi.org/10.59350/cnkm2-18f84

Funder metadata is becoming more important as the global research infrastructure is engaged as a tool for quantifying impact of funders on research results and as interest in open science increases. The transition from Crossref Funder IDs to ROR IDs for funders (French et al., 2023, Lammey, 2024) is also focusing some attention on the funder identifiers used in these metadata.

A recent workshop described benefits and challenges faced as the community tries to increase the amount and quality of funder metadata (De Jonge et al., 2023). The authors observed: “Authors often persist in the wrong spelling of their funder and do not choose predefined suggestions, making it very difficult to match input to Funder IDs”. Using acronyms instead of complete funder names is another persistent problem for a variety of organizations in the U.S. and around the world, specifically NSF and NIH and likely others (Habermann, 2021, 2022).

Working with funder identifiers as part of the INFORMATE project (Habermann et al., 2023) has brought the realities of these challenges to the fore. In this blog we highlight real-world examples that might be helpful in illustrating the problem.

Crossref Funder Metadata

The U.S. National Science Foundation (Crossref Funder ID:100000001, ROR: https://ror.org/021nxhr62) emerges as a major player wherever one looks at funder metadata. A recent CHORUS (https://www.chorusaccess.org/) report of Crossref article DOIs including both the NSF funder ID and name in the funder metadata totaled 85,091 journal articles with metadata for 970,803 awards. The acronym “NSF” occurred 792,452 times in the funder names associated with these awards and the acronym “NSF” occurred by itself 167,767 times (Figure 1).

*Figure 1. Counts of DOIs, Awards, and funder names in Crossref metadata retrieved by CHORUS for the U.S. National Science Foundation.*

A ROR search for “NSF” illuminates the challenge of associating an identifier (a ROR ID in this case) with these “NSF” acronyms. Nine potential choices, all identified from the acronym, share the same score (0.9), so no best choice is chosen by the algorithm. Even if the country is included in the search, there are three equal choices.

Matching Type

Score

Chosen

ROR

Organization

Country

ACRONYM

0.90

False

00se21e39

Nick Simons Foundation

United States

ACRONYM

0.90

False

00zc1hf95

National Sleep Foundation

United States

ACRONYM

0.90

False

10xaa060

National Science Foundation of Sri Lanka

Sri Lanka

ACRONYM

0.90

False

01822d048

Norwegian Nurses Organisation

Norway

ACRONYM

0.90

False

01fy9e204

The Neurosciences Foundation

United Kingdom

ACRONYM

0.90

False

021nxhr62

National Science Foundation

United States

ACRONYM

0.90

False

03t3x0x69

Norsk Sosiologforening

Norway

ACRONYM

0.90

False

05eg49r29

Bulgarian Science Fund

Bulgaria

Table 1. Organizations identified by ror.org as potential matches for the acronym "NSF". Note all have the same score and none are chosen by the algorithm. These results were provided by RORRetriever (Habermann, 2022a).

Six of these organizations have Crossref Funder IDs and Table 2 shows the number of times they occur in the CHORUS data for NSF. Together these organizations make up over 2200 cases. Of course, it is impossible to know whether these funder metadata are correct without knowing more details about the source.

Organization

Crossref Funder ID

Count

Norwegian Nurses Organisation

501100004190

Bulgarian Science Fund

501100003336

National Sleep Foundation

100003187

1233

The Neurosciences Foundation

501100020414

Nick Simons Foundation

100016620

National Science Foundation of Sri Lanka

501100008982

906

The funder metadata in the CHORUS Report is retrieved from Crossref (Habermann, 2023) who collect metadata received from publishers as well as others. De Jonge et al. 2023 explored challenges in the metadata workflow and identified collecting funder metadata and extracting it from free text as a common source of funder metadata errors. This reflects the inherent complexity of funding sources and the diversity of describing those sources in free text. This acknowledgement text fragment illustrates typical challenges:

“S. E. O., L. D. T., S. T. G., and V. T. (at sea) were supported by the Southern Ocean Carbon and Climate Observations and Modeling Project under NSFPLR-1425989; V. T. and I. C. received significant support from NSFOCE-1357072. S. T. G. and I. C. were also supported by NSFOCE-1658001. S. A. J. is supported by the UK Natural Environment Research Council, including the ORCHESTRA grant (NE/N018095/1).”

First, funding is different for different authors, represented by initials, and spread throughout the text. Second, funder names are represented as acronyms (“NSF”) in several cases and spelled out in another. Finally, funder acronyms are concatenated with award numbers (NSFOCE-1658001). This diversity is not unusual as these acknowledgements have been written for humans to read for many years.

Of course, there are many steps through many systems that these metadata take between the researcher and the global research infrastructure. It could be that someone along the way mistakenly picked the wrong funder id and/or funder name from a list of funder names in a manuscript tracking system. If authors were given the opportunity to check the funder id/funder name when checking their article proofs, they may well have spotted the error and fixed it. However, this would require the publisher’s system to show information not part of the original manuscript submission to the author.

DataCite Commons Funder Metadata

DataCite Commons makes it possible to search a different swath of the global research infrastructure for funders using Crossref Funder and ROR IDs. For example, the URL https://commons.datacite.org/ror.org/010xaa060 provides summary information on over 10,000 Text, Journal Article, Dataset and Collection resources that include the ID for the National Science Foundation of Sri Lanka (https:// ror.org/010xaa060), see Figure 1.

Most of these resources (~7,616) are text resources and over 200 are datasets. The datasets are retrieved directly from DataCite as opposed to being linked to articles retrieved from Crossref in the CHORUS case. Similar data can be retrieved for the National Sleep Foundation.

*Figure 2. DataCite Commons page (https://commons.datacite.org/ror.org/010xaa060) for National Science Foundation of Sri Lanka with information on connections across the global research infrastructure.*

The related works from DataCite Commons can be retrieved using the DataCite Commons GraphQL API (see Fenner, 2021 for guidance on pagination using this API). The query shown below retrieves DOI, type, registrationAgency, publisher, and funder metadata.

  {
  organization(id: "https://ror.org/010xaa060") {
    id
    name
    works(first: 10) {
      totalCount
      pageInfo {
        endCursor
        hasNextPage
      }
      nodes {
        doi
        type
      	registrationAgency 
          
        
        publisher 
          
        
        publicationYear
        fundingReferences {
          funderName
          funderIdentifier
          awardNumber
        }
      }
    }
  }
}

The National Science Foundation of Sri Lanka data retrieved from DataCite Commons includes DataCite funder metadata for 753 awards and the funder identifier for the National Science Foundation of Sri Lanka (https://doi.org/10.13039/501100008982) is listed for 435 of these awards. The funder name in every case is “National Science Foundation”.

Combining award metadata from DataCite and Crossref, there are 11,573 occurrences of this funder ID associated with 136 different funder name strings, most of them variations on National Science Foundation and some identifying national science foundations in countries other than the United States and Sri Lanka. The most common funder names are shown in Table 3. It is clear that most of these funder IDs are incorrect, in fact, only 744/11,573 (6%) of these funder names are the correct name for this identifier.

Funder ID and Name

Count

https://doi.org/10.13039/501100008982 (Total)

11,573

National Science Foundation

9,974

National Science Foundation of Sri Lanka

735

NSF

518

Swiss National Science Foundation

MoSTR | National Science Foundation

U.S. National Science Foundation

Award number information is more limited in these metadata, but many of the award numbers are consistent with known patterns for U.S. National Science Foundation award numbers (seven-digit numbers or combinations of NSF abbreviations and seven digit numbers).

The picture is very similar for the National Sleep Foundation, although in this case all 882 of the related resources are from Crossref. These include 897 awards with the National Sleep Foundation funder id (https://doi.org/10.13039/100003187), 206 (23%) of which have the funder name “National Sleep Foundation” and 672 of which have the funder name “National Science Foundation”.

Conclusion

The global research literature is filled with free text names of researchers, organizations, and funders which make it difficult to unambiguously recognize these entities and correctly connect them to their contributions. Persistent Identifiers (PIDS) have been developed over the last several decades to help address these problems and implementation of these identifiers in many publication systems is underway. Tracking these implementations, understanding problems, and fixing them is critical as these efforts move forward.

Identifiers for funders and awards form the foundation for understanding funder contributions to the global research landscape, so it is important to identify and trace the sources of problems that might occur as required automation is developed and spreads across organizations and processing systems. The U.S. National Science Foundation funds many awards across multiple scientific disciplines and is ubiquitously referred to using the acronym “NSF”. The large number of occurrences of this acronym makes it possible to use it to identify problems that may only occur a small percentage of the time.

Using metadata from Crossref included in CHORUS reports, we searched for identifiers of multiple organizations with the acronym “NSF” and identified over 2000 references to organizations other than the U.S. National Science Foundation within the correctly identified NSF-related records. In many cases, we identified metadata where funder names were given as “National Science Foundation” even though the identifiers were for other organizations. We identified similar funder ambiguities and apparent identifier errors in metadata from the DataCite Commons for datasets and text resources.

The overall number of errors is generally small (<2%) relative to the total number of resources, but it elucidates a problem that may occur more broadly and not be recognized and there may be significant impact in particular repositories. Including funder metadata as part of an article’s metadata is crucial for the workflow of open science. As open metadata found in places like CHORUS, Crossref, ROR and DataCite are used to characterize funder impacts in open research, automated validation aimed at identifying errors like those described here are a critical part of the value chain.

Acknowledgements

This work is part of the INFORMATE Project, a partnership between Metadata Game Changers and CHORUS. This work was funded by the U.S. National Science Foundation (https://ror.org/021nxhr62), award 2334426.

Data Availability

The data retrieved from the DataCite Commons for this work are available at https://doi/org/10.5281/zenodo.11116775. CHORUS reports are available for several funding agencies at https://dashboard.chorusaccess.org/.

References

DataCite Commons, Data for National Science Foundation of Sri Lanka, https://commons.datacite.org/ror.org/010xaa060, retrieved May 4, 2024.

DataCite Commons, Data for National Sleep Foundation, https://commons.datacite.org/ror.org/00zc1hf95, retrieved May 4, 2024.

DataCite Commons, DataCite GraphQL API Guide, https://support.datacite.org/docs/datacite-graphql-api-guide, retrieved May 4, 2024.

De Jonge, H., Kramer, K., Michaud, F., and Hendricks, G., 2023, Open funding metadata through Crossref; a workshop to discuss challenges and improving workflows, https://www.crossref.org/blog/open-funding-metadata-community-workshop-report/.

Fenner, M., Pagination with cursor in GraphQL API, https://pidforum.org/t/pagination-with-cursor-in-graphql-api/1572, retrieved May 4, 2024.

French, A., Hendricks, G., Lammey, R, Michaud, F., and Gould, M., 2023, Open Funder Registry to transition into Research Organization Registry (ROR), https://www.crossref.org/blog/open-funder-registry-to-transition-into-research-organization-registry-ror/.

Habermann, T., 2021, Acronyms are Definitely Not Enough, https://doi.org/10.59350/93v82-yr723.

Habermann, 2022, Funder Metadata: Identifiers and Award Numbers, https://doi.org/10.59350/xrqzb-re120.

Habermann, 2022a, Need help searching for RORs? Try RORRetriever!, https://doi.org/10.59350/4gxfz-4kb47.

Habermann, 2023, CHORUS Data Journey, https://doi.org/10.59350/ksgzn-a6w37.

Habermann, T., Jones, J., Packer, T., and Ratner, H., 2023, INFORMATE: Metadata Game Changers and CHORUS Collaborate to Make the Invisible Visiblehttps://doi.org/10.59350/yqkat-59f79.

Lammey, R., 2024, RORing ahead: using ROR in place of the Open Funder Registry, https://www.crossref.org/blog/roring-ahead-using-ror-in-place-of-the-open-funder-registry/

Blog

Funder Acronyms Are Still Not Enough

Introduction

Crossref Funder Metadata

DataCite Commons Funder Metadata

Conclusion

Acknowledgements

Data Availability

References

Metadata Game Changers

Tell us what you think!