Improving Domain Repository Connectivity: Identifying Organizations
/Cite this blog as Habermann, T. (2021). Improving Domain Repository Connectivity: Identifying Organizations. Front Matter. https://doi.org/10.59350/8xcgf-drf32
The UNAVCO DataCite Repository has over 5000 records that describe datasets created by researchers from many organizations, all of which are members of the tight-knit and well-established UNAVCO community. In the first blog of this series, I proposed that connecting these organizations to the PID Graph depends on having unique identifiers, i.e., RORs, for these organizations. Most of these organizations have contributed multiple datasets to the community, so they occur multiple times in the metadata. This characteristic of domain communities, i.e. multiple contributions from the same people and organizations, simplifies the process of populating organizational identifiers in the repository because each identifier is used many times.
Table 1 shows the organizations that occur in the UNAVCO metadata along with RORs found for these organizations and the number of times they occur. As expected, a small number of organizations (35) occur many times (2596) in these metadata. The most common organization is UNAVCO itself, which occurs in 1288 records (50%). It is also important to note that RORs were identified for all organizations in the metadata.
Table 1. Organizations from UNAVCO DataCite Metadata and RORs found for them.
Organization
Identifier
Count
UNAVCO
https://ror.org/02n9tn974
1288
University of Colorado Boulder
https://ror.org/02ttsq026
364
United States Geological Survey
https://ror.org/035a68863
124
Colorado State University
https://ror.org/03k1gpj17
32
University of Montana
https://ror.org/0078xmk34
32
New Mexico Institute of Mining and Technology
https://ror.org/005p9kw61
32
Pennsylvania State University
https://ror.org/04p491231
28
Oregon State University
https://ror.org/00ysfqy60
24
University of Oregon
https://ror.org/0293rh119
24
San Diego State University
https://ror.org/0264fdx42
20
George Washington University
https://ror.org/00y4zzh67
16
Idaho State University
https://ror.org/0162z8b04
16
Boston University
https://ror.org/05qwgg493
16
The Ohio State University
https://ror.org/00rs6vg23
16
Office of Polar Programs
https://ror.org/05nwjp114
12
Goddard Space Flight Center
https://ror.org/0171mag52
12
University of Miami
https://ror.org/02dgjyy92
12
The University of Texas at San Antonio
https://ror.org/01kd65564
12
Dartmouth College
https://ror.org/049s0rh22
12
National Aeronautics and Space Administration
https://ror.org/027ka1x80
12
Gustavus Adolphus College
https://ror.org/007q4yk54
8
University of California Davis
https://ror.org/05rrcem69
8
University of Washington
https://ror.org/00cvxb145
8
University of Chicago
https://ror.org/024mw5h28
8
Georgia Institute of Technology
https://ror.org/01zkghx44
8
Harvard University
https://ror.org/03vek6s52
8
National Park Service
https://ror.org/044zqqy65
4
Bates College
https://ror.org/003yn7c76
4
University of Tennessee at Knoxville
https://ror.org/020f3ap87
4
University of Minnesota
https://ror.org/017zqws13
4
Woods Hole Oceanographic Institution
https://ror.org/03zbnzt98
4
The University of Texas at El Paso
https://ror.org/04d5vba33
4
Texas A&M University
https://ror.org/01f5ytq51
4
University of Michigan–Ann Arbor
https://ror.org/00jmfr291
3
University of Michigan–Ann Arbor
https://ror.org/00jmfr291
1
These Affiliations and RORs were found using two different techniques. In most cases (2184) the affiliations were included in the metadata along with the individual creator. For example:
{ "name": "Doe, Jane", "nameType": "Personal", "affiliation": [ "UNAVCO, Inc." ] }
In this case, the association of the author and the organization is clear. In some cases, however, an author appears on some datasets with affiliations and in others without. Is it possible to spread the known affiliations across the datasets which do not have affiliations in the metadata?
In the first blog in this series, I described the process of spreading ORCIDs for dataset authors across occurrences of the authors in the metadata that did not originally include ORCIDS. This increased the number of ORCIDs in the metadata significantly and, because the association between a person and their ORCID is one-to-one, there is high confidence in the assertion of the connection between the person and the ORCID. In the affiliation case, the confidence is not so high, as authors can readily switch organizations.
This situation is illustrated in Figure 1. This author has authored nine datasets and ORCIDs are included in three of them. In this case, the ORCID connectivity for this author is 33%. The connectivity is increased to 100% by adding the ORCID for this author to the six datasets that originally had no ORCIDs, indicated by the grey arrows.
The same datasets are shown again on the right side of the Figure along with affiliations. In this case affiliation A1 occurs three times and A2 occurs two times and there are four papers without affiliations. Can either affiliation be added to the other four datasets? Of course there is no answer here that has 100% confidence. The rules used here to spread affiliations as shown by the grey arrows:
1. if only one affiliation exists, use it
2. if more than one affiliation exists, use the most common one
3. if two affiliations exist and occur an equal number of times, use both.
In addition, affiliations identified this way are flagged for evaluation by the author, or by a community member that is familiar with their affiliation history. So, if the spreading introduces errors, they can be corrected.
Fortunately, this ambiguity only occurred in four out of fifty-four cases. In the others the authors only had one associated affiliation.
Organizational Connectivity
As mentioned in the last blog, connectivity is the % of items in a collection that have identifiers, ORCIDs or RORs in this case. The UNAVCO metadata do not include RORs, so affiliations were used as a proxy for the RORs and connectivity was calculated for ORCIDs and affiliations. Now that RORs have been identified, we can calculate the organizational connectivity directly. Figure 2 shows the ROR connectivity after RORs were identified. The results show that 6% of the DOIs have RORs for all authors (complete connectivity), 8% have RORs for some authors (partial connectivity), and 85% have no RORs (missing connectivity). This is a significant improvement over the initial state in which no DOIs had RORs (100% missing).
Conclusions
Improving the number of identifiers of all kinds (connectivity) across repositories is important as the role of connections increases in data discovery, understanding, and re-use. Domain repositories can benefit from well-developed communities in making these improvements because community members, both individuals and organizations, make multiple contributions through time.
These benefits were demonstrated using the UNAVCO DataCite repository as an example. In this case, the three most common ORCIDs make up 50% of all ORCIDs in the repository. Similarly, as shown here, the three most common RORs make up almost 70% of the RORs in the repository.
Taking advantage of this characteristic and the spreading approach described above, the connectivity of the UNAVCO repository was increased using information already in the repository. The portion of DOIs with all ORCIDs (complete connectivity) increased by a factor of three, from 2 to 6% while the portion of organizations with all RORs increased from zero to 6% (see Table 2).
Table 2. Increased connectivity for people and organizations in the UNAVCO DataCite repository.
Party | Organizations | |||||
Initial | 93% | 4% | 2% | 100% | 0% | 0% |
Improved | 86% | 8% | 6% | 85% | 8% | 6% |
Unfortunately, like in many other repositories, the portion of DOIs without ORCIDs or RORs remains very high. Fortunately, like many domain repositories, UNAVCO maintains a list of papers that have been written by community members using data from UNAVCO. Metadata for these papers is another source of information that can be brought to bear on the problem of increasing connectivity.