How Many RORs Do We Need?

Cite this blog as Habermann, T. (2019). How Many RORs Do We Need? Front Matter. https://doi.org/10.59350/4tbaw-m9382

The scholarly communications and open science communities are getting excited about the benefits of metadata that includes persistent identifiers for organizations. Now that an open registry of organizational identifiers exists, the community is facing two adoption challenges: 1) evolving metadata dialects to include identifiers and 2) implementing them in existing and future metadata. Several dialects already include this capability and others are implementing it and building these identifiers into searches. In other words, great progress on Step 1. How can we move forward on Step 2?

I took a look at a sample of DataCite metadata and discovered that many DataCite repositories manage metadata from a small number of organizations and, therefore, only need to know a small number of organizational identifiers (RORs are the focus here) to populate their entire collections. These repositories are supported by DataCite members. An obvious next question is: Are there DataCite members that only need to know a small number of RORs? 

My sample of over 95,000 DataCite records shows that twenty-three DataCite members have five or fewer affiliations in their metadata included in the sample. The answer appears to be: yes, some DataCite members only need to know a small number of RORs. More good news.

The Affiliations

The DataCite members with five or less affiliations are shown in Table 1. Combined, they have almost fifty unique affiliation strings in their existing metadata (Table 1). Even this small sample demonstrates the kinds of challenges associated with free text affiliation strings. There are small punctuation differences (“American University, Center for Social Media” vs. “American University Center for Social Media”), acronyms (BGI) and abbreviations (Penn vs. Pennsylvania, CalTech vs. California Institute of Technology), content left over from templates (Mysteeriorganisaatio), and multiple names for the same organization (“Technological University Dublin” vs. “Technological University for Dublin”). These and other similar challenges will keep us busy until persistent identifiers for organizations are universal!

Table 1. DataCite members with five or less affiliations in the sample metadata and those affiliations.

Member Affiliation
Aridhia Informatics Ltd. (id = aridhia) EPAD
Aridhia Informatics
American University Library (id = au) Center for Media and Social Impact
American University, Center for Social Media
American University Center for Social Media
Center for Social Media
American University
The Commonwealth Fund (id = cmwf) Dartmouth College
University of Oxford
Commonwealth Fund
China National GeneBank (id = cngb) BGI
Network for Computational Modeling in the Social and Ecological Sciences (CoMSES Net) (id = comses) Arizona State University
National Research Institute of Science and Technology for Environment and Agriculture
Unité Mixte Internationnale de Modélisation Mathématique et Informatiques des Systèmes Complèxes
CSC IT Center for Science (id = csc) Aalto University
Mysteeriorganisaatio
University of Colorado Boulder Libraries (id = cub) University of Colorado Boulder
CyArk (id = cyark) CyArk
Cyberleninka (id = cyberl) International Society for the Comparative Study of Civilizations (ISCSC), Board of Directors Member; Pro-Rector, International University for the Societal Development; Washington, the USA
Publications Office of the European Union (id = europ) mEDRA
Global Biodiversity Information Facility (id = gbif) The University Museum, The University of Tokyo
Globus (id = globus) University of Illinois at Urbana Champaign
Lawrence Berkeley National Laboratory
University of Illinois Urbana-Champaign Library (id = illinois) University of Illinois at Urbana-Champaign
Indiana University (id = iu) Indiana University School of Medicine
Indiana University Bloomington
National Institute of Standards and Technology (id = nist) National Institute of Standards and Technology
OCLC, Inc. (id = oclc) OCLC Research
Pennsylvania State University (id = psu) Penn State University
Pennsylvania State University
University of Illinois at Urbana Champaign
University of Adelaide
Smithsonian Institution
Research Workspace (id = rewo) Audubon Alaska
TIND (id = tind) California Institute of Technology
Caltech
Caltech Library
The Aerospace Corporation
University College Dublin (id = ucd) Technological University Dublin
Technological University for Dublin
University of Gothenburg
National University of Ireland, Galway
United Language Group (United States)
University of Miami Libraries (id = uml) University of Miami
UNAVCO (id = unavco) UNAVCO, Inc.
Pennsylvania State University
University of Colorado - Boulder
United States Geological Survey
University of Washington - Seattle
University of Utah (id = uutah) Journal of Early Hearing Detection and Intervention
Texas A&M University

Finding the RORs

This small sample presents a microcosm of the world of finding identifiers for organizations in order to augment existing metadata by adding identifiers. This challenge has been addressed in many contexts and experiences from the American Physical Society, NASA, Dryad, and many others all indicate that it is a tricky business. The challenges identified in this small sample become difficult at scales when automation is required.

I have been experimenting with two strategies I developed with Dryad and the affiliation search recently introduced by ROR. These are referred to as Dryad, String, and Affiliation strategies. Results of searching for RORs with all three strategies are discussed below.

Affiliations with One ROR

The most common result, twenty-three of forty-nine cases, was for the same ROR to be identified by all three strategies. These are indicated with DSA in the Strat column of Table 2 and are mostly affiliation strings that 1)match the ROR name for the organization, 2) identify only one organization, and 3) do not include much extraneous information. All good affiliation name practices.

The next most common result is one ROR identified by the String and Affiliation strategies (SA in Table 2). These are also clean affiliation strings with names that match ROR names or aliases (Caltech). The Dryad strategy did not identify these affiliations because it uses an affiliation lookup table built from past experience. When that table is updated with the SA affiliations and RORs, these will become DSA matches.

The final group of RORs are those identified by just one strategy (D or A). These reflect differences in the details of the strategies that inevitably show up when searching with multiple strategies. It could be the curation involved in the Dryad lookup table, or details of the matching and ranking algorithms in the ROR affiliation search. In any case, these provide interesting learning opportunities. The only suspicious pick is for the affiliation “Journal of Early Hearing Detection and Intervention” which is the name of a journal instead of an organization.

Table 2. Affiliations with one ROR.

Affiliation Strat* ROR Organization Name Country
Aalto University DSA https://ror.org/020hwjq30 Aalto University Finland
American University DSA https://ror.org/052w4zt36 American University United States
American University, Center for Social Media DSA https://ror.org/052w4zt36 American University United States
Arizona State University DSA https://ror.org/03efmqc40 Arizona State University United States
California Institute of Technology DSA https://ror.org/05dxps055 California Institute of Technology United States
Dartmouth College DSA https://ror.org/049s0rh22 Dartmouth College United States
Indiana University Bloomington DSA https://ror.org/02k40bc56 Indiana University Bloomington United States
Lawrence Berkeley National Laboratory DSA https://ror.org/02jbv0t02 Lawrence Berkeley National Laboratory United States
National Institute of Standards and Technology DSA https://ror.org/05xpvk416 National Institute of Standards and Technology United States
National Research Institute of Science and Technology for Environment and Agriculture DSA https://ror.org/01wep6g48 National Research Institute of Science and Technology for Environment and Agriculture France
Pennsylvania State University DSA https://ror.org/04p491231 Pennsylvania State University United States
Smithsonian Institution DSA https://ror.org/01pp8nd67 Smithsonian Institution United States
Texas A &M University DSA https://ror.org/01f5ytq51 Texas A&M University United States
The University Museum, The University of Tokyo DSA https://ror.org/057zh3y96 University of Tokyo Japan
UNAVCO, Inc. DSA https://ror.org/02n9tn974 UNAVCO United States
United States Geological Survey DSA https://ror.org/035a68863 United States Geological Survey United States
University of Adelaide DSA https://ror.org/00892tw58 University of Adelaide Australia
University of Colorado Boulder DSA https://ror.org/02ttsq026 University of Colorado Boulder United States
University of Gothenburg DSA https://ror.org/01tm6cn81 University of Gothenburg Sweden
University of Illinois at Urbana Champaign DSA https://ror.org/047426m28 University of Illinois at Urbana Champaign United States
University of Miami DSA https://ror.org/02dgjyy92 University of Miami United States
University of Oxford DSA https://ror.org/052gg0110 University of Oxford United Kingdom
University of Washington - Seattle DSA https://ror.org/00cvxb145 University of Washington United States
University of Illinois at Urbana-Champaign DA https://ror.org/047426m28 University of Illinois at Urbana Champaign United States
Caltech SA https://ror.org/05dxps055 California Institute of Technology United States
Commonwealth Fund SA https://ror.org/049kzbj92 Commonwealth Fund United States
Technological University Dublin SA https://ror.org/04t0qbt32 Technological University Dublin Ireland
The Aerospace Corporation SA https://ror.org/01ar9e455 The Aerospace Corporation United States
United Language Group (United States) SA https://ror.org/02zf60h18 United Language Group (United States) United States
Unité Mixte Internationnale de Modélisation Mathématique et Informatiques des Systèmes Complèxes SA https://ror.org/053kxkj53 Unité Mixte Internationnale de Modélisation Mathématique et Informatiques des Systèmes Complèxes France
American University Center for Social Media D https://ror.org/052w4zt36 American University United States
Journal of Early Hearing Detection and Intervention A https://ror.org/057xcq908 Inerventions (Sweden) Sweden
OCLC Research A https://ror.org/02nv42w72 Online Computer Library Center United States
Technological University for Dublin A https://ror.org/04t0qbt32 Technological University Dublin Ireland
University of Colorado - Boulder A https://ror.org/02ttsq026 University of Colorado Boulder United States
* Strategy (D=Dryad, S=String, A=Affiliation)

Affiliations with two RORs

Two RORs were found for three affiliations that are listed in Table 3. Cases like these typically need to be solved with additional information from the complete metadata, the metadata creators, or other sources. The affiliations here illustrate some common factors that can cause uncertainty: acronyms, organizations within organizations, or multiple campuses.

Table 3. Affiliations with two RORs.

Organization Strat* ROR Organization Name Country
BGI S https://ror.org/045pn2j94 Beijing Genomics Institute China
A https://ror.org/02act3e13 BGI (United States) United States
Indiana University School of Medicine D https://ror.org/02k40bc56 Indiana University Bloomington United States
A https://ror.org/05ht4p406 Indiana University School of Medicine - Lafayette United States
National University of Ireland, Galway DA https://ror.org/00shsf120 National University of Ireland Ireland
S https://ror.org/03bea9k73 National University of Ireland, Galway Ireland
* Strategy (D=Dryad, S=String, A=Affiliation)

Affiliations with No RORs

Finally, there are eleven affiliations that could not be resolved to RORs shown in Table 4. Some of these are organizations that do not yet have RORs and some are affiliation strings that can not be resolved currently without context or human intervention.

An interesting example of how context and interaction with metadata creators can help is provided by two affiliations in Table 4: “Center for Media and Social Impact” and “Center for Social Media”. Table 1 shows that these are two of the five affiliation strings associated with the American University Library. A google search for “American University Center for Social Media” points to the Center for Media & Social Impact at American University so it appears that all five of the American University Library affiliations shown in Table 1 are actually the same organization, but none of them have the current name from the organization website. The correct ROR in this case (American University - https://ror.org/052w4zt36) was identified for three of the five affiliations (Table 2).

Table 4. Affiliationa with no RORs.

Organizations with No RORs
Aridhia Informatics EPAD
Audubon Alaska International Society for the Comparative Study of Civilizations (ISCSC), Board of Directors Member; Pro-Rector, International University for the Societal Development; Washington, the USA
Caltech Library mEDRA
Center for Media and Social Impact Mysteeriorganisaatio
Center for Social Media Penn State University
CyArk

Conclusions

The initial question posed here was: are there DataCite members that only need to know a few RORs to populate their entire DataCite metadata collection with organizational identifiers. I took a look at twenty-three DataCite members with five or less affiliations in their metadata. Thirteen of the twenty-three members included (57%) were a complete success – RORs were identified for all of the affiliation strings in the sample metadata. Another five were above 50% and, finally, five had no RORs identified.

Even this small sample of affiliations (49) demonstrated some common challenges along the path to ROR adoption. First, we found a case where a single organization was represented five different ways in one metadata collection. This problem is pervasive in systems built on free-text entry of any information into any data collection. It is one of the central problems that the introduction of identifiers of any kind is trying to address.

Second, we found that affiliations written as unaccompanied acronyms are difficult to resolve unambiguously. It is well known that acronyms need to be spelled out in scientific papers and the same is true when entering affiliations into metadata. Remember that you are writing the affiliation so that the world knows where you work, avoid acronyms, colloquialisms, and jargon!

Finally, several strategies being used to find identifiers from affiliation strings agreed in most of the simple cases. At the same time, for more complex cases, each strategy does well in some and poorly in others. There are always cases where humans, internet searches, and interactions with metadata creators and managers will be needed to get the right answer.