Improving Domain Repository Connectivity: Identifying Organizations

Cite this blog as Habermann, T. (2021). Improving Domain Repository Connectivity: Identifying Organizations. Front Matter. https://doi.org/10.59350/8xcgf-drf32

The UNAVCO DataCite Repository has over 5000 records that describe datasets created by researchers from many organizations, all of which are members of the tight-knit and well-established UNAVCO community. In the first blog of this series, I proposed that connecting these organizations to the PID Graph depends on having unique identifiers, i.e., RORs, for these organizations. Most of these organizations have contributed multiple datasets to the community, so they occur multiple times in the metadata. This characteristic of domain communities, i.e. multiple contributions from the same people and organizations, simplifies the process of populating organizational identifiers in the repository because each identifier is used many times.

Table 1 shows the organizations that occur in the UNAVCO metadata along with RORs found for these organizations and the number of times they occur. As expected, a small number of organizations (35) occur many times (2596) in these metadata. The most common organization is UNAVCO itself, which occurs in 1288 records (50%). It is also important to note that RORs were identified for all organizations in the metadata.

Table 1. Organizations from UNAVCO DataCite Metadata and RORs found for them.

Organization

Identifier

Count

UNAVCO

https://ror.org/02n9tn974

1288

University of Colorado Boulder

https://ror.org/02ttsq026

364

United States Geological Survey

https://ror.org/035a68863

124

Colorado State University

https://ror.org/03k1gpj17

32

University of Montana

https://ror.org/0078xmk34

32

New Mexico Institute of Mining and Technology

https://ror.org/005p9kw61

32

Pennsylvania State University

https://ror.org/04p491231

28

Oregon State University

https://ror.org/00ysfqy60

24

University of Oregon

https://ror.org/0293rh119

24

San Diego State University

https://ror.org/0264fdx42

20

George Washington University

https://ror.org/00y4zzh67

16

Idaho State University

https://ror.org/0162z8b04

16

Boston University

https://ror.org/05qwgg493

16

The Ohio State University

https://ror.org/00rs6vg23

16

Office of Polar Programs

https://ror.org/05nwjp114

12

Goddard Space Flight Center

https://ror.org/0171mag52

12

University of Miami

https://ror.org/02dgjyy92

12

The University of Texas at San Antonio

https://ror.org/01kd65564

12

Dartmouth College

https://ror.org/049s0rh22

12

National Aeronautics and Space Administration

https://ror.org/027ka1x80

12

Gustavus Adolphus College

https://ror.org/007q4yk54

8

University of California Davis

https://ror.org/05rrcem69

8

University of Washington

https://ror.org/00cvxb145

8

University of Chicago

https://ror.org/024mw5h28

8

Georgia Institute of Technology

https://ror.org/01zkghx44

8

Harvard University

https://ror.org/03vek6s52

8

National Park Service

https://ror.org/044zqqy65

4

Bates College

https://ror.org/003yn7c76

4

University of Tennessee at Knoxville

https://ror.org/020f3ap87

4

University of Minnesota

https://ror.org/017zqws13

4

Woods Hole Oceanographic Institution

https://ror.org/03zbnzt98

4

The University of Texas at El Paso

https://ror.org/04d5vba33

4

Texas A&M University

https://ror.org/01f5ytq51

4

University of Michigan–Ann Arbor

https://ror.org/00jmfr291

3

University of Michigan–Ann Arbor

https://ror.org/00jmfr291

1

These Affiliations and RORs were found using two different techniques. In most cases (2184) the affiliations were included in the metadata along with the individual creator. For example:

{
    "name": "Doe, Jane",
    "nameType": "Personal",
    "affiliation": [
        "UNAVCO, Inc."
    ]
}

In this case, the association of the author and the organization is clear. In some cases, however, an author appears on some datasets with affiliations and in others without. Is it possible to spread the known affiliations across the datasets which do not have affiliations in the metadata?

In the first blog in this series, I described the process of spreading ORCIDs for dataset authors across occurrences of the authors in the metadata that did not originally include ORCIDS. This increased the number of ORCIDs in the metadata significantly and, because the association between a person and their ORCID is one-to-one, there is high confidence in the assertion of the connection between the person and the ORCID. In the affiliation case, the confidence is not so high, as authors can readily switch organizations.

This situation is illustrated in Figure 1. This author has authored nine datasets and ORCIDs are included in three of them. In this case, the ORCID connectivity for this author is 33%. The connectivity is increased to 100% by adding the ORCID for this author to the six datasets that originally had no ORCIDs, indicated by the grey arrows.

Figure 1. Spreading identifiers through metadata collections.

Figure 1. Spreading identifiers through metadata collections.

The same datasets are shown again on the right side of the Figure along with affiliations. In this case affiliation A1 occurs three times and A2 occurs two times and there are four papers without affiliations. Can either affiliation be added to the other four datasets? Of course there is no answer here that has 100% confidence. The rules used here to spread affiliations as shown by the grey arrows:

1.    if only one affiliation exists, use it

2.    if more than one affiliation exists, use the most common one

3.    if two affiliations exist and occur an equal number of times, use both.

In addition, affiliations identified this way are flagged for evaluation by the author, or by a community member that is familiar with their affiliation history. So, if the spreading introduces errors, they can be corrected.

Fortunately, this ambiguity only occurred in four out of fifty-four cases. In the others the authors only had one associated affiliation.

Organizational Connectivity

As mentioned in the last blog, connectivity is the % of items in a collection that have identifiers, ORCIDs or RORs in this case. The UNAVCO metadata do not include RORs, so affiliations were used as a proxy for the RORs and connectivity was calculated for ORCIDs and affiliations. Now that RORs have been identified, we can calculate the organizational connectivity directly. Figure 2 shows the ROR connectivity after RORs were identified. The results show that 6% of the DOIs have RORs for all authors (complete connectivity), 8% have RORs for some authors (partial connectivity), and 85% have no RORs (missing connectivity). This is a significant improvement over the initial state in which no DOIs had RORs (100% missing).

Figure 2. Bar showing the portion of UNAVCO DataCite metadata records with organizations with known RORs.

Figure 2. Bar showing the portion of UNAVCO DataCite metadata records with organizations with known RORs.

Conclusions

Improving the number of identifiers of all kinds (connectivity) across repositories is important as the role of connections increases in data discovery, understanding, and re-use. Domain repositories can benefit from well-developed communities in making these improvements because community members, both individuals and organizations, make multiple contributions through time.

These benefits were demonstrated using the UNAVCO DataCite repository as an example. In this case, the three most common ORCIDs make up 50% of all ORCIDs in the repository. Similarly, as shown here, the three most common RORs make up almost 70% of the RORs in the repository.

Taking advantage of this characteristic and the spreading approach described above, the connectivity of the UNAVCO repository was increased using information already in the repository. The portion of DOIs with all ORCIDs (complete connectivity) increased by a factor of three, from 2 to 6% while the portion of organizations with all RORs increased from zero to 6% (see Table 2).

Table 2. Increased connectivity for people and organizations in the UNAVCO DataCite repository.

Party Organizations
Initial 93% 4% 2% 100% 0% 0%
Improved 86% 8% 6% 85% 8% 6%

Unfortunately, like in many other repositories, the portion of DOIs without ORCIDs or RORs remains very high. Fortunately, like many domain repositories, UNAVCO maintains a list of papers that have been written by community members using data from UNAVCO. Metadata for these papers is another source of information that can be brought to bear on the problem of increasing connectivity.