Improving Domain Repository Connectivity: Establishing a Baseline

Cite this blog as Habermann, T. (2021). Improving Domain Repository Connectivity: Establishing a Baseline. Front Matter. https://doi.org/10.59350/kft70-xgz24

As I have been working with domain repositories to understand and describe their practices and apply for Core Trust Seal certification, I have been struck by the close, long-term relationships that these repositories form with their communities. In some cases, like UNAVCO, the repository is an integral part of an extensive community support system that extends from proposal planning and writing, through project initiation and implementation, data collection, management, and archive, to publication of results and access to data by other community members. Scientists, engineers, logistics specialists, data managers, software developers, and educators work together to create and extend our understanding of the shape of the earth and how it changes (the science of Geodesy). 

The UNAVCO Community described the responsibilities of players in open science communities during 2012 (https://doi.org/10.1029/2012EO260006) and developed an open data policy based on those responsibilities. These responsibilities included identifying datasets with PIDs and connecting data to papers with citations, that is, establishing an important element of the PID Graph: connections between papers and data.

I introduced the concept of Connectivity last month and have been thinking about it ever since. Connectivity measures how well research objects or collections of research objects are connected to the global research web, represented by the PID Graph. These connections depend on identifiers for all kinds of research objects. I am initially focusing on people, identified by ORCIDs, and organizations, identified by RORs.

As the breadth of identifiers and connections continues to expand, I made the leap from the strong connections between real people and organizations in the UNAVCO Community and connections between these entities in the PID Graph. Specifically, I wondered if the multitudinous real-world connections could help populate identifiers in the metadata and related connections in the PID Graph. I begin the exploration of this question here with UNAVCO datasets described in DataCite.

UNAVCO Datasets in DataCite

UNAVCO has minted over 5000 dataset DOIs with DataCite since 2013 (Figure 1). UNAVCO maintains an archive of these datasets with extensive metadata for discovery, access, and understanding, so the primary role of the DataCite repository is minting DOIs for identification and citation of the datasets.

Figure 1. The number of UNAVCO datasets registered in DataCite per year.

Figure 1. The number of UNAVCO datasets registered in DataCite per year.

DataCite is also the place where the UNAVCO Community connects their data to the broader scientific world through identifiers included in the metadata. As mentioned above, I am particularly interested in connectivity through ORCIDs and RORs. Characterizing the current state of these identifiers in the collection is the first step in understanding the collection and measuring improvements in the connectivity that might be achieved through time.

Visualizing Connectivity

Our goal is to understand how to improve connectivity in domain repositories and to use connectivity as a metric for measuring progress as connectivity improves. In order to do this, we must be able to express connectivity as numbers and pictures. I do this using a horizontal bar which represents the entire collection and color sections of the bar green for items that have complete connectivity (on the left), yellow for items that have partial connectivity (in the middle), and red for items that have no connectivity (on the right Figure 2).

Figure 2. Visualizing connectivity as the % of items with all identifiers (green, left), with some identifiers (yellow, middle), and with no identifiers (red, right).

Figure 2. Visualizing connectivity as the % of items with all identifiers (green, left), with some identifiers (yellow, middle), and with no identifiers (red, right).

The desired end state for connectivity is maximizing the % of the collection that has complete connectivity, so improvements make the complete part of the bar larger and the partial and missing parts of the bar smaller, illustrated in Figure 2 by the change between the lower and upper bars.

ORCID Connectivity Baseline

Connectivity can be measured for many kinds of identifiers and for any collection of research objects or other entities in the PID Graph. For a single paper, ORCID connectivity is the % of authors that have ORCIDs (see Connectivity). This calculation can be easily extended to a collection with the ORCID connectivity being the % of authors in all collection items that have ORCIDs.

The baseline ORCID connectivity for the UNAVCO DataCite collection is shown in Figure 3. The largest part of the datasets in the collection have no ORCIDS (93%, 5005/5356 DOIs) while 234 (4%) have some ORCIDS, and 117 (2%) have all ORCIDS.

Figure 3. Initial (baseline) connectivity for ORCIDs in UNAVCO DataCite metadata.

Figure 3. Initial (baseline) connectivity for ORCIDs in UNAVCO DataCite metadata.

The lack of ORCIDs in the UNAVCO metadata is not unusual. An assessment of 144 DataCite repositories in the TIB Consortium showed that, on average, less than 15% of the records in these repositories have identifiers and a similar assessment of all Crossref metadata during 2019 showed that the average portion of Crossref records with ORCIDs was less than 10% and it was only during mid-2020 that the average number of ORCID’s per article in Crossref passed 2.0. We are clearly at the beginning of the ORCID adoption process across the scientific publishing world and we have a lot of room for increased adoption.

On the positive side, fifty-three authors have ORCIDs in this metadata that occur in a total of 499 datasets and just over 350 datasets had complete or partial ORCID coverage. The most common ORCID belongs to Marianne Okal, an engineer at UNAVCO. Her ORCID occurs 130 times in these data. Severn other ORCIDS occur more than ten times.

Affiliation Connectivity Baseline

The UNAVCO DataCite metadata do not currently include any organizational identifiers but the metadata do include affiliation names that can give us an idea of the maximum organizational connectivity that we can achieve if we can find identifiers for all of the affiliations.

Figure 4 shows the initial affiliation connectivity for UNAVCO metadata at DataCite which is very similar to the data for ORCIDs in Figure 3. In fact, the numbers are slightly better, with 382 records having complete or partial connectivity. Unfortunately, 93% of the records are still missing affiliation information.

Figure 5. Average connectivity for ORCIDs (orange) and affiliations (blue) per year.

Figure 5. Average connectivity for ORCIDs (orange) and affiliations (blue) per year.

Author Connectivity

The observations so far have focused on connectivity for UNAVCO datasets represented by DOIs.  As mentioned earlier, connectivity can be calculated for any entity in the PID Graph. The UNAVCO DataCite metadata provide an opportunity to calculate ORCID connectivity for authors in the metadata that have ORCIDS. In this context, authors with complete connectivity have associated ORCIDs in all metadata records where they appear, i.e. they are completely connected. Authors with partial connectivity have ORCIDs only in some of the records where they appear. The records that include these authors but do not have ORCIDs provide an easy and completely safe opportunity to improve connectivity in the metadata by adding know ORCIDs for these authors in records currently missing them.

For example, we know that Marianne Okal, an engineer at UNAVCO, is the author with the most common ORCID in these data, it occurs in 130 datasets. She is an author on five other datasets that do not include here ORCID in the metadata, so, she is an author 135 times and has 130 ORCIDs and her connectivity is 130/135 = 96%. We can increase the number of records with ORCIDs by five by adding her ORCID to five records that are currently missing it. This also increases her connectivity to 1, i.e., complete.

Figure 6 shows that 26% of the authors with known ORCIDs have partial connectivity. The total number of datasets authored by these authors without ORCIDs is 182. Adding these to the 499 records with ORCIDs increases the number of ORCIDs in the metadata by 36%.

Figure 6. Author connectivity in UNAVCO DataCite metadata

Figure 6. Author connectivity in UNAVCO DataCite metadata

Figure 7 compares the ORCID connectivity before and after the addition of known ORCIDs. As expected on the basis of the discussion above, there is a significant improvement. The partial and complete DOIs now make up 14% of the collection as compared to 6% in the initial baseline and the number of records missing ORCIDs decreased by 8%. As mentioned earlier, this improvement was achieved with the completely safe assertions that ORCIDs for authors do not change.

Figure 7. Improvement in ORCID connectivity associated with spreading known ORCIDS to metadata without ORCIDs.

Figure 7. Improvement in ORCID connectivity associated with spreading known ORCIDS to metadata without ORCIDs.

Conclusion

UNAVCO is a domain repository with very strong real-world connections to the Geodetic community in the United States and across the world. In order for these connections to be reflected in the PID Graph, UNAVCO datasets and the papers that use them must have unique and persistent identifiers and metadata for those research objects must include identifiers for people, organizations, and other related research entities.

The UNAVCO DataCite metadata currently includes some or all ORCIDs for 6% of the datasets, reflecting only a small portion of the existing connections. This blog post describes a quantitative measure of the repository connectivity and uses that metric to demonstrate a significant increase in connectivity accomplished by including known ORCIDs across all records. This first step suggests that connectivity can be increased using existing community information resources. Further improvements will be described in future blogs.