DataCite Bright Spots

Ted Habermann and Erin Robinson, Metadata Game Changers

Over the last decade DataCite has grown into a critical part of the foundation of the global research infrastructure, holding Digital Object Identifiers (DOIs) and metadata for over 85 million research objects. The DataCite metadata schema has also evolved considerably (Habermann, 2021, 2025 and Stathis et al. 2023), incorporating new structured content, identifiers, resource types, and relations. While these changes have created powerful new capabilities, adoption of the metadata elements that support them has been slow. In fact, the six original mandatory fields and the resource landing page still make up the bulk of the DataCite metadata.

Positive deviance (Marsh et al., 2004) is a problem-solving approach that identifies individuals or groups who are succeeding despite facing challenges like those faced by their whole communities. These Bright Spots serve as examples that the community then learns from to develop effective solutions. It's essentially about recognizing that communities often possess their own internal solutions, and that these solutions can be uncovered by studying and understanding those who have already overcome the difficulties. An important first step towards the goal of community adoption is creating metrics that can be used to identify Bright Spots and as guides for metadata creators and managers.

The DataCite Metadata Schema can be used to address many use cases which can have very different metadata requirements. An initial recommendation for FAIR DataCite Metadata was proposed as part of MetaDIG (Habermann, 2019). Since that time, the recommendation has evolved into sets of documentation concepts that can be mapped to DataCite metadata elements that can help address four FAIR Use Cases: FAIR Text, FAIR Identifiers, FAIR Connections, and FAIR Contacts. Together these four use cases include 61 metadata elements, seven of which are mandatory (Table 1).

Use Case

Metadata Elements

FAIR Text (15)

Resource Identifier, Resource Publication Date, Resource Publisher, Resource Title, Resource Author, Resource Type General, Abstract, Keyword, Resource Author Affiliation, Keyword Vocabulary, Project Funder, Date Created, Spatial Extent, Award Title, Temporal Extent

FAIR Identifiers (18)

Resource Author Type, Resource Type, Resource Author Identifier Type, Resource Author Identifier, Resource Author Affiliation Identifier, Resource Author Affiliation Identifier Scheme URI, Resource Author Affiliation Identifier Type, Resource Identifier Type, Keyword Vocabulary URI, Publisher Identifier, Publisher Identifier Type, Publisher Identifier Scheme URI, Date Submitted, Funder Identifier Type, Award Number, Funder Identifier, Award URI, Keyword Value URI

FAIR Connections (18)

Resource URL, Rights, Resource Size, Resource Format, Date Available, Resource Contact, SupplementTo, CitedBy, Technical Information, Methods, ReferencedBy, DocumentedBy, SourceOf, Distribution Contact, DescribedBy, HasMetadata, RightsHolder, ReviewedBy

FAIR Contacts (10)

Rights URI, Resource Contact Identifier Scheme, Resource Contact Identifier, Resource Contact Identifier Scheme URI, Distribution Contact Identifier Scheme, Distribution Contact Identifier, Rights Holder Identifier Scheme, Distribution Contact Identifier Scheme URI, Rights Holder Identifier, Rights Holder Identifier Scheme URI

Table 1. Documentation concepts that can be mapped to DataCite metadata elements (Habermann, 2024) that support four FAIR use cases. Numbers of elements for each use case in parentheses, mandatory elements are bold.

This recommendation is different than many others, as it goes significantly beyond a minimum metadata recommendation. Experience with minimum recommendations in many communities indicates that the metadata you get is the metadata you recommend, i.e. the recommendations can function as limitations on the metadata that are produced. This recommendation includes four potential use cases across the entire FAIR spectrum. It is aspirational and serves the goal of using DataCite capabilities to the fullest extent possible. 

Prior analysis of 387 university repositories identified the University of Bath as the most complete of these repositories and subsequent work at the University improved completeness for several elements. This repository provides an outstanding example of complete metadata and also of continuous metadata improvement that we are trying to identify and encourage. 

This work has been extended to cover 3066 currently active DataCite repositories. Figure 1 shows the Total FAIRness scores for these repositories as a function of repository size. As expected by positive deviance, there are some repositories with metadata much more complete than others of similar size. These are the Bright Spots.

Figure 1. Overall FAIRness completeness scores for 3066 DataCite Repositories. Repositories in the green box are Bright Spots. The average completeness (23%) and the Bright Spot limit (2 standard deviations above mean, 38%) are shown. Random samples of 5000 were taken for repositories with > 5000 records (13% of the repositories).

There are many ways to identify Bright Spots both visually and quantitatively. The green box in Figure 1 shows repositories more than two standard deviations above the mean. Visually, there is a slight break in the data close to 49% and the repositories with scores above that break are listed in Table 2.

Repository Name
ID
# Records
FAIR Text
FAIR Identifiers
FAIR Connections
FAIR Contacts
Total
University of Bath
bl.bath
773
79%
79%
41%
37%
61%
National Institute for Fusion Science
rpht.nifs
5000
87%
78%
28%
10%
54%
RISIS_CNR
rpak.hpgbbj
2
80%
72%
25%
30%
53%
StrainInfo (Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures)
zypi.vnulkh
5000
80%
83%
22%
10%
52%
EarthEnv
ktsw.aezvvv
1
80%
61%
33%
30%
52%
Grand Accélérateur National d’Ions Lourds
inist.ganil
20
73%
55%
34%
40%
51%
South Australian Health and Medical Research Insti-tute
sahmri.repo
6
86%
57%
20%
42%
51%
Archaeological Map of the Czech Republic
arch.igsn
5000
87%
66%
28%
10%
51%
Archaeological Map of the Czech Republic
arch.avanjj
5000
80%
67%
28%
10%
49%
The Chandra Data Archive
si.cda
5000
87%
39%
39%
30%
49%

Table 2. Metadata completeness for four FAIR use cases and total for repositories with total completeness >= 49%.

Together these repositories reflect the breadth of the DataCite community. The University of Bath Research Data Archive (bl.bath) is an institutional repository with less than 800 records while the National Institute for Fusion Science (rpht.nifs) is the largest repository in DataCite with over 19 million records. Four others are large repositories with well over 15,000 records. StrainInfo at the Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures (zypi.vnulkh) is part of the German National Research Infrastructure that includes a diverse portfolio of bioresources for biodiversity researchers from academia and industry. The Archaeological Map of the Czech Republic has two large repositories, one with physical samples (arch.igsn) and one with reports and images (arch.avanjj). The Chandra Data Archive (si.cda) is a collection of data collected by an X-ray telescope. There are also repositories with only a handful of records. They are starting out on the right foot!

Conclusions

The task of creating complete and consistent metadata for collections of diverse research resources demands commitment, perseverance, and hard work. These metrics make invisible repository improvements visible and we celebrate them.

The breadth of the recommendation reflects the fact that there are many pathways to success in the DataCite community. The green box in Figure 1 shows 138 repositories that have total completeness two or more standard deviations above the mean (completeness >38%). All of these Bright Spot repositories have overcome obstacles that we all experience and understand. They demonstrate that those obstacles can be overcome. The range of use cases and applications suggests that there is no reason to make the Bright Spot Club exclusive and any repository can join! Congratulations to all. Good luck and keep up the great work.

Data Availability

The completeness results for all 3066 repositories are available in in Zenodo. They include repository names and ids, the number of records, the completeness for four use cases and the total, and the provider id.

References

Habermann, T., (2019). MetaDIG recommendations for FAIR DataCite metadata. https://doi.org/10.5438/2chg-b074.

Habermann, T. (2020). Minimum Metadata. Front Matter. https://doi.org/10.59350/kanrj-qt678

Habermann, T. (2021). DataCite Metadata: Evolving to FAIRness. Front Matter. https://doi.org/10.59350/ftzmw-k9q02

Habermann, T. (2025). DataCite Metadata Continues to Add Capabilities. Front Matter. https://doi.org/10.59350/wgret-2kd02

Marsh, D., Schroder, D.G., Dearden, K., Sternin, J., Sternin, M., 2004. The Power of Positive Deviance, BMJ, https://doi.org/10.1136/bmj.329.7475.1177.

Stathis, K., Chen, X., Cousijn, H., & Puebla, I. (2023, November 8). The DataCite Metadata Schema Through Time. A Decade of Data: Celebrating 10 Years of the Research Data Alliance, Online. Zenodo. https://doi.org/10.5281/zenodo.10081013