Cite this blog as Habermann, T. (2021). Can Repositories Improve Metadata? Front Matter. https://doi.org/10.59350/t6dna-wbp85

The road to complete and consistent metadata can be long and arduous – digging through piles of metadata and other kinds of data to find small gems of information that can be added to metadata records, contacting recalcitrant researchers to fill in blanks, slowly building content across a collection… Does it really need to be that hard?

Over the last several years, the Dryad Data Repository carried out an ambitious metadata improvement project described by Lowenberg and Habermann, 2019 and Gould and Lowenberg, 2019. The repository needed to track data contributions by organization, i.e. the affiliations of the authors of submitted data. Unfortunately, the original Dryad metadata model used for almost 30,000 datasets did not include a field for author affiliation (see Habermann, 2019 for description of metadata model). The original Dryad was conceived as a repository for data associated with published articles and the metadata for those articles included author and affiliation information (along with the methods and results), so the dataset did not need that metadata. This presented a difficult multi-step metadata archeology problem. The pile of metadata that needed to be searched for identifiers had to be created before it was searched.

Solving this problem depended on metadata that we did have in many cases, the identifiers of the articles that the datasets were supplements to. We used those identifiers (DOIs) to search CrossRef and other sources (journal websites and article metadata) for author affiliation strings, then searched those strings for organization names, and then searched ROR for identifiers associated with those organizations. There were many opportunities to be unsuccessful in this search: datasets may not have associated articles, authors may not have affiliations, and organizations may not have RORs. Success meant 1) finding an associated paper, 2) finding an affiliation string for an author, 3) finding an organization name in that string, and 4) finding a ROR for the organization. Overall, we were able to find at least one ROR for 60% of the datasets and 55% of the authors.

This work was done in the early days of the ROR project, i.e. before the Affiliation API was developed and deployed, so Dryad was a very early adopter of RORs. Since that time, many organizations have integrated organizational identifiers into their metadata models and then into their metadata. A year ago I identified leaders in ROR adoption at DataCite and I revisited those numbers today (see Table 1) with funder identifiers as well as creators and contributors. The earlier table included the top 15 in each category. This time it is only 10 as the default facet limit has decreased in the DataCite API.

Client

Creator
Record Count

Contributor
Record Count

Funder
Record Count

cern.zenodo

618,156

tib.ipk

183,397

183,396

pangaea.repository

42,883

datacite.topmed

42,379

dryad.dryad

20,129

6,460

ipk.gbis

16,725

16,726

tib.fdmuh

23,700

bl.cam

21,800

bl.imperial

14,582

6,264

imu.ub

16,387

gesis.icpsr

15,713

caltech.library

7,857

1,909

si.cda

8,464

heallink.tuc

8,321

inist.ehess

5,789

tib.ldeo

3,177

bl.nerc

1,135

994

tuw.repositum

1,443

odu.viva

1,424

dkrz.esgf

633

carr.carr

tib.gfzbib

jcvi.gxpwaq

bl.cefas

Table 1. Updated list of DataCite clients with the top 10 number of creator and contributor affiliation identifiers and funder identifiers.

I am happy to see that Dryad is still comfortably in the top two for creator affiliation identifiers with 32,650. The new leader is tib.ipk, The Leibniz Institute of Plant Genetics and Crop Plant Research (https://ror.org/02skbsp27) with, incredibly, over 183,000 creator identifiers. How can this be?

It turns out that many DataCite client metadata collections are very homogeneous in many ways, including the affiliations of creators, contributors, and funders. This makes sense as many of the clients are associated with single institutions. During PIDapalloza last year I introduced the Affiliation Homogeneity Index as a tool for predicting the ease of ROR adoption in client metadata collections. The index is calculated by dividing the count of the most common affiliation in DataCite client metadata by the total number of affiliations in the metadata (Figure 1):

Figure 1. The Affiliation Homogeneity index (AFI) formula. If a set of metadata only has one affiliation, AFI = 100%.

If client metadata has only one affiliation, the count of the most common affiliation equals the total number of affiliations and the AHI = 100%. This means that the client only needs one ROR to adopt ROR across all of its metadata. In the tib.ipk case described above, that ROR is https://ror.org/02skbsp27 and it occurs over 183,000 times in the metadata.

The good news is that many of the DataCite repositories have AHI of 100%, i.e. only need one ROR for complete adoption. In fact, seven of the top 10 in Table 1 have AHI of 100%. Dryad has the lowest AHI of the group at 15%, confirming the experience described above, i.e. it was a challenge to find thousands of RORs needed for Dryad.

DataCite added an affiliation facet to their API responses roughly two years ago. This allows us to retrieve the number and occurrence frequencies of affiliations in all client metadata. I used that capability to retrieve affiliation statistics for 243 clients with affiliations and calculated the Affiliation Homogeneity Index for all of them (for 53 clients with more than ten affiliations (21%) this is actually a maximum estimate). Figure 2 shows the distribution of these numbers. Once again, the number of clients with only one affiliation (AHI = 100%) dwarfs all others. We know that some of these have already connected their organizations to the PID graph by assigning ROR’s and other identifiers. The others are low-hanging fruit in the affiliation identifier orchard!

Figure 2. The % of DataCite clients with affiliation homogeneity indices in some range. The huge peak at 100% shows that 43% of clients with affiliations in their metadata only have one.

We started with two questions: “Can repositories improve metadata?” and “Does metadata improvement need to be so hard?”. Adding persistent and unique identifiers (of any kind) is one of the most valuable improvements that can be made to existing metadata. The Dryad case demonstrates that repositories can add identifiers to metadata even when it is hard. Tib.ipk and many of the DataCite clients listed in Table 1 found that adding identifiers does not have to be so hard, if, for example, your metadata describes resources from a small number of organizations. Taking a look at DataCite metadata indicates that adding organizational identifiers to many DataCite client metadata collections actually involves looking up or knowing a very small number of identifiers and that most of the identifier associations are very clear, i.e. very safe assertions. If you would like to explore adding identifiers or making other improvements to your DataCite metadata, please contact me (ted@metadatagamechangers.com).

Blog

Can Repositories Improve Metadata?

Metadata Game Changers

Tell us what you think!