Can Repositories Improve Metadata?
/The road to complete and consistent metadata can be long and arduous – digging through piles of metadata and other kinds of data to find small gems of information that can be added to metadata records, contacting recalcitrant researchers to fill in blanks, slowly building content across a collection… Does it really need to be that hard?
Over the last several years, the Dryad Data Repository carried out an ambitious metadata improvement project described by Lowenberg and Habermann, 2019 and Gould and Lowenberg, 2019. The repository needed to track data contributions by organization, i.e. the affiliations of the authors of submitted data. Unfortunately, the original Dryad metadata model used for almost 30,000 datasets did not include a field for author affiliation (see Habermann, 2019 for description of metadata model). The original Dryad was conceived as a repository for data associated with published articles and the metadata for those articles included author and affiliation information (along with the methods and results), so the dataset did not need that metadata. This presented a difficult multi-step metadata archeology problem. The pile of metadata that needed to be searched for identifiers had to be created before it was searched.
Solving this problem depended on metadata that we did have in many cases, the identifiers of the articles that the datasets were supplements to. We used those identifiers (DOIs) to search CrossRef and other sources (journal websites and article metadata) for author affiliation strings, then searched those strings for organization names, and then searched ROR for identifiers associated with those organizations. There were many opportunities to be unsuccessful in this search: datasets may not have associated articles, authors may not have affiliations, and organizations may not have RORs. Success meant 1) finding an associated paper, 2) finding an affiliation string for an author, 3) finding an organization name in that string, and 4) finding a ROR for the organization. Overall, we were able to find at least one ROR for 60% of the datasets and 55% of the authors.
This work was done in the early days of the ROR project, i.e. before the Affiliation API was developed and deployed, so Dryad was a very early adopter of RORs. Since that time, many organizations have integrated organizational identifiers into their metadata models and then into their metadata. A year ago I identified leaders in ROR adoption at DataCite and I revisited those numbers today (see Table 1) with funder identifiers as well as creators and contributors. The earlier table included the top 15 in each category. This time it is only 10 as the default facet limit has decreased in the DataCite API.
Client
Creator
Record Count
Contributor
Record Count
Funder
Record Count
cern.zenodo
618,156
tib.ipk
183,397
183,396
pangaea.repository
42,883
datacite.topmed
42,379
dryad.dryad
20,129
6,460
ipk.gbis
16,725
16,726
tib.fdmuh
23,700
bl.cam
21,800
bl.imperial
14,582
6,264
imu.ub
16,387
gesis.icpsr
15,713
caltech.library
7,857
1,909
si.cda
8,464
heallink.tuc
8,321
inist.ehess
5,789
tib.ldeo
3,177
bl.nerc
1,135
994
tuw.repositum
1,443
odu.viva
1,424
dkrz.esgf
633
carr.carr
68
tib.gfzbib
47
jcvi.gxpwaq
47
bl.cefas
47
Table 1. Updated list of DataCite clients with the top 10 number of creator and contributor affiliation identifiers and funder identifiers.