Ted Habermann and Erin Robinson, Metadata Game Changers

Introduction

The FAIR Principles cover a broad spectrum of repository activities from metadata content to repository practices and interactions with user communities, but they do not include much guidance on specific metadata elements. Principles F2 (data are described with rich metadata) and R1 (metadata have a plurality of accurate and relevant attributes) mention metadata specifically, but responsibility for identifying specific metadata elements that support FAIR data is left to community standards (Principle R1.3).

A community convention for FAIR DataCite metadata was proposed several years ago (Habermann, 2019). Table 1 shows that convention as groups of documentation concepts selected to support four use cases related to the FAIR Principles. Documentation concepts are independent concepts that can be mapped to many metadata dialects (Gordon and Habermann, 2018). The concepts in Table 1 map to 1) specific DataCite elements, e.g., title, 2) to relatedIdentifier.relationTypes, e.g., DocumentedBy, or 3) to contributorTypes, e.g., Distributor. See Habermann, 2024d for more details.

Title

Code

Description

# of Concepts

Concepts

Text

FAIR_ Findable_ Support

Documentation concepts used to support many types of data discovery, i.e. text, spatial/temporal, author, publisher, etc.

Abstract, Date Created, Keyword, Keyword Vocabulary, Resource Author Affiliation, Resource Author, Resource Identifier, Resource Publication Date, Resource Publisher, Resource Title, Resource Type General, Project Funder, Award Title, Temporal Extent, Spatial Extent

Identifiers

FAIR_ Findable_ Essential

Documentation concepts that provide extra information, i.e. identifiers and references, about discovery concepts.

Date Submitted, Keyword Value URI, Keyword Vocabulary URI, Resource Author Type, Resource Author Identifier, Resource Author Identifier Type, Resource Author Affiliation Identifier, Resource Author Affiliation Identifier Type, Resource Author Affiliation Identifier Scheme URI, Resource Identifier Type, Resource Type, Funder Identifier, Funder Identifier Type, Award URI, Award Number

Connections

FAIR_ AIR_ Essential

Documentation concepts for datasets interoperability and connections for resource documentation, understanding, and trust.

CitedBy, Date Available, DescribedBy, Distribution Contact, DocumentedBy, Resource Contact, HasMetadata, Resource Format, Resource Size, Resource URL, Rights, RightsHolder, Methods, Technical Information, ReferencedBy, ReviewedBy, SourceOf, SupplementTo

Contacts

FAIR_ AIR_ Support

Documentation concepts for contacts that can answer questions not addressed in the metadata or other documentation.

Resource Contact Identifier, Resource Contact Identifier Scheme, Resource Contact Identifier Scheme URI, Distribution Contact Identifier, Distribution Contact Identifier Scheme, Distribution Contact Identifier Scheme URI, Rights Holder Identifier, Rights Holder Identifier Scheme, Rights Holder Identifier Scheme URI, Rights URI

Table 1. Documentation concepts recommended for four FAIR metadata use cases.

These conventions alone don’t tell us how FAIR metadata are or when aggregated, how FAIR a particular repository is, but they can provide a framework for measuring metadata completeness as a first step in metadata improvement (Gordon and Habermann, 2018). We used these conventions as a guide to measure FAIRness of DataCite metadata in over 1400 repositories (Burger et al., 2021; Habermann, 2021a) and to measure FAIRness for ~1.3 million DataCite metadata records collected from repositories associated with almost 400 universities and colleges worldwide during January 2024. Habermann (2024a, b, c) showed that most of these repositories focus on providing the six mandatory DataCite elements required to get a DOI, a behavior which is consistent with the initial intent of the DataCite schema to support resource identification and citation. Despite this intent, the DataCite schema includes many elements that support the broader FAIR goals (Access, Interoperability, and Reuse, i.e., AIR, Table 1).

This blog focuses on using those recent measurements to identify the repositories with the most complete metadata overall, i.e. bright spots in university metadata at DataCite. We measure and visualize metadata completeness for 58 metadata elements in four FAIR use cases, two of which correspond to Findability and two of which correspond to Accessibility, Interoperability, and Reuse (AIR in Table 1). The completeness visualization is comprised of four radar plots, one for each use case, with names of the recommended documentation concepts around the outside of each plot. The completeness for each concept is 0% at the center of the plot and 100% along the outside edge.

Minimum Metadata

Most DataCite repositories focus on the six mandatory elements and the landing page URL for the resource getting a DOI. These elements occur in the Findable Text and Findable Identifiers use cases (bold in Table 1). They are present in all repositories, so being able to recognize them in the completeness visualization is helpful.

Figure 1. Completeness visualization for four FAIR use cases and score for a repository that includes only the six mandatory elements and the resource URL.

Figure 1 shows the completeness visualization for a repository with only the mandatory elements. Six of these elements occur in the Findable Text visualization in the upper-left frame. Each of these elements is complete, i.e. in 100% of the records, so the radar plot connects points along the outer edge of the plot. The Resource URL is required for accessibility, so it is included in the AIR Connections use case in the lower left. It is just one field by itself, so it appears as a single line from the center of the radar plot to the edge, like the Resource Author in the Findable Text use case.

These mandatory elements are ubiquitous across almost all repositories, so the scores shown in this Figure (Findable Text = 40%, AIR Connections = 6%, Findable Identifiers = AIR Contacts = 0) are baselines for repository completeness. Together these result in an overall completeness score of 12%.

Average Metadata

It is also helpful to know the completeness of a typical repository when interpreting these data. Figure 2 shows the completeness visualization for the average repository including all 1.3 million records in the entire dataset. The average number of records in each repository is 3332.

Figure 2. Average completeness over 387 repositories for four use cases.

The average repository includes the mandatory elements shown in Figure 1, along with several recommended and optional elements from the DataCite schema which raise the score for the Findable Text use case to 51%. Abstract and Keywords, which map to description and subject in the DataCite schema, are the most common recommended elements in the Findable Text use case, occur in ~65% and ~45% of the records. Affiliation strings for authors, near the bottom of the Findable Text plot, occur in ~30% of the records and Funders occur in <10%.

The Findable Identifier use case in the upper right indicates that many repositories take advantage of the free-text Resource Type element to provide more details than the mandatory Resource Type General elements in the Findable Text use case. It also shows that identifiers for authors, organizations, funders, and awards are generally rare in these metadata, all occurring in < 20% of the records in the average repository. We expect this number might increase as the influence of the recent Open Data Memo (OSTP, 2022) reaches the universities and colleges that produce this DataCite metadata.

The AIR Connections and Contacts use cases remain almost completely empty in the average repository, indicating that few universities and colleges are taking advantage of the capability to make connections between datasets and articles, software, instruments, computational notebooks, or other research objects supported in the DataCite schema or to provide data format metadata and other important interoperability information.

Repository Bright Spots

To find examples that others in the community can learn from and follow, termed bright spots by Heath and Heath, 2010, we can compare overall completeness scores for all repositories. These scores are shown as a function of repository size (number of records) in Figure 3.

Figure 3. Overall completeness scores vs. repository size for 387 university and college repositories. The line shows the average score of 23%. Repositories at 14,500 are repositories with > 14,000 records.

While repository size varies considerably and completeness decreases slowly as the number of records increases, Figure 3 shows that examples with above average metadata exist across the complete spectrum of sizes. We chose three bright spots which vary in size and are some of the most complete in this sample to examine in detail. They are identified in Figure 3 and listed in Table 2.

Repository Name (DataCite Repository ID)

# Records

Score (%)

University of Bath (bl.bath)

677

University of Oxford (figshare.oxford)

10,000+

University of Cape Town (figshare.uct)

10,000+

Table 2. Bright spots identified in Figure 3.

University of Bath

The University of Bath Research Data Library is a truly outstanding bright spot among this group with 677 records and an overall score of 46% which is twice the average overall score (Figure 3). The completeness visualization for this repository, shown in Figure 4, has many outstanding features.

One outstanding feature of this repository is the coverage for all aspects of funder metadata. In the Findable Text use case, Project Funder is included in 99% of the records and Award Title is included in 87% of the records. In the Findable Identifier use case, Award Number is included in 81% of the records and Funder Identifier is included in 92% of the records.

The content of the funder metadata is also interesting. The most common funder is Engineering and Physical Sciences Research Council (EPSRC) which occurs in 556 records and 147 other funders occur 695 times (many records have multiple funders). The inclusion of this funder metadata is remarkable and likely reflects significant effort across the entire university community.

Resource Author Affiliation is also very complete in this repository. Almost 200 organizations are represented in 97% of the records. One of these, University of Bath, occurs as 1596 of the 2061 affiliations which is to be expected for the repository home university (Habermann, 2022). The other affiliations provide great opportunities to quantify collaboration of the University of Bath with other organizations.

This repository was also identified as a bright spot for the AIR Connection use case and Figure 4 shows why. The repository includes extensive interoperability metadata (resource size and format), provenance metadata (methods), connections to other research objects (ReferencedBy) and over 1500 specific rights statements for datasets and specific figures.

As described above, the DataCite schema includes many elements that support all the FAIR Principles, so even in an exemplary FAIR repository like this one, there are opportunities for improvement. The quantitative approach demonstrated here helps to identify these opportunities and combined with examination of existing content, pick good candidates for short-term wins that can help metadata improvement efforts gain and maintain momentum.

One potential short-term win for this case is Resource Author Affiliation Identifiers which are nearly invisible near the bottom of the Findable Identifiers use case. The repository includes University of Bath as an affiliation almost 1600 times, but only includes the university’s ROR (https://ror.org/002h8g185) six times. As a bonus, 573 records have the University of Bath as Rights Holder and the ROR could also be added as Rights Holder Identifier for these records. Many of the other affiliations are simple, single organizations, so a tool like ROR Retriever (Habermann, 2022b) could be used to find RORs for these as well.

Figshare, the University of Oxford and the University of Cape Town

The other two bright spots come from the group of large repositories shown with a record count of 14,500 near the right side of Figure 3. These repositories have more than 14,500 records, but a random sample of 10,000 was considered in this work. These two repositories are leading the large repositories and represent an interesting partnership between the academic and commercial sectors. The repository ids (figshare.oxford and figshare.uct) indicate that these repositories run on the Figshare repository platform. Figshare is a repository product of the company Digital Science that is used as a generalist repository by many institutions, in this case, the University of Oxford and the University of Cape Town. The completeness visualizations for these two bright spots are shown in Figures 5 and 6.

Figure 5. Completeness visualization for the University of Oxford bright spot.

Figure 6. Completeness visualization for the University of Cape Town bright spot.

Generalist repositories typically focus on data discovery rather than domain-specific metadata, so it is not surprising that these two repositories are close to the top in the Findable Text and Identifier use cases. They both do well with Abstracts and Keywords (over 90%) which are important elements for full text discovery.

Keyword Vocabularies are also provided in ~40% of the Oxford records and over 90% of the Cape Town records. Most keywords in the Cape Town repository come from the Australian and New Zealand Standard Research Classification (ANZSRC) and both repositories include a significant number of keywords from the Field of Science vocabulary. The FAIR Principles highlight the importance of standard keyword vocabularies and these two are heavily used across many repositories.

Both repositories have significant coverage for Resource Author Affiliations and for Resource Author Affiliation Identifiers, both RORs and GRIDs (Oxford > 60% and Cape Town > 90%). In both cases, the home institutions dominate these identifiers.

The Research Author Affiliation Identifier data for these repositories illustrate a feature of the tool used in this analysis that encourages identifier metadata that includes Identifier Type and Identifier Schema URI. All three of these elements are included in the recommendation so a single complete identifier yields three “points” rather than just one. The same is true for all identifiers in the recommendation. This highlights the importance of identifiers of all kinds as an essential part of the foundation of the global research infrastructure.

Both repositories are almost complete for the Date Created element, which is different than the publication date. This element is rare in the average repository, occurring in less than 15% of the records. Its presence here suggests that it is collected as part of the Figshare submission process and included in the metadata even though it is not mandatory. The same is true for the affiliations discussed above. These two examples remind us that collecting metadata elements is an important first step towards including them in metadata records and that submission systems play a critical role in that collection process. Figshare is to be recognized for collecting all these non-mandatory elements and passing them to DataCite.

Like in the Bath case, these two bright spots also have clear opportunities for improvement, particularly in the two AIR use cases. Making connections to resources in a generalist repository is challenging because many occur after the resource is published in the repository so re-curation tools are required.

Discussion

A recent exploration of data sharing in several large universities (Johnson et al., 2024) identified many challenges with improving DataCite (and Crossref) metadata completeness and recommended that repositories “Implement data sharing practices and infrastructure decisions with the pipeline to the global metadata infrastructure in mind” with the goal of enhancing “the findability, interoperability, accessibility, and reusability (FAIRness) of research data”.

In this blog we have demonstrated an approach to measuring progress towards this goal that shares many characteristics with the idea of continuous improvement initially popularized by Deming (Wikipedia, 2024) and applied in many contexts. Figure 7 shows a schematic model of metadata improvement cycles with three snapshots suggested by the data presented above.

Figure 7. Schematic model of metadata improvement cycles.

The Mandatory snapshot corresponds to the initial creation of metadata records with only mandatory elements required to mint a DOI like that shown in Figure 1. Metadata created with these elements has a completeness score of 12%. If the repository decides to improve the completeness of the metadata they share, they identify high-impact changes that match a high-priority use case (wins) and re-curate the repository content or improve sharing tools to get those elements into DataCite. If the use case is data findability, the wins might be elements in the average repository shown in Figure 2 or they may be other elements recommended by the DataCite Schema. After this content is added or shared, the completeness is re-measured and increases to ~25%. As other high-priority use cases emerge, the cycle is repeated. In some situations, one of the subsequent use cases might be funder metadata and wins would be the various funder related elements. Once this content is added or shared, the completeness is measured again, and it increases to ~50%. After several improvement cycles, past improvements get integrated into repository practices, re-curation becomes curation, and submitted records become more complete (Habermann, 2021b).

Of course, this specific scenario is speculative, and the stories associated with each of the repositories in this dataset are different. Together the three bright spots demonstrate that repositories can follow many paths to more FAIR DataCite metadata. In some cases, they may focus on increased discovery with text, including abstracts and keywords; in others, they may focus on funder metadata; in others, on interoperability metadata, or other forms of documentation (methods and connections to papers); in others, they may focus on domain specific metadata included in the Data schema like spatial extent for earth science data. DataCite includes over 3000 repositories and Figure 3 shows that a variety of good examples exist.

Conclusion

The bright spots identified here serve as inspiration for others as they consider how they may improve or share more of their existing metadata with DataCite. They also demonstrate the potential for significant return on investments resulting in more complete metadata. The bright spot repositories deserve hearty thanks for being leaders in our community and doing the hard work required to build and sustain repositories with metadata that span all the FAIR Principles.

A critical piece of any improvement cycle is visibility. In many cases, metadata improvements happen in the shadows. Metadata managers know these improvements are there, but the increased completeness and the return on investment they provide are invisible to the broader community. New developments, like the tools shared in this blog for measurement, or DataCite Commons and the PIDGraph for browsing and connecting, are making the rich metadata and connections visible and helping to demonstrate their value.

Visibility is just one step. It is important to keep in mind that, while these observations suggest repository behaviors in metadata creation and sharing, only repository managers and staff, i.e., the hands on the data, can illuminate the motivations behind these behaviors for others. Recognition of this good work may open the door for increased understanding of these motivations and effective incentives for more complete metadata.

We would like to support the repository community in demonstrating success and building sustained momentum, another important element of organizational change. If you are a repository manager that wants to work together on measuring and improving your DataCite metadata to go beyond required or average metadata or if you re-curate your metadata and want to be re-assessed let us know! We will continue this series of bright spot blog posts exploring other repositories doing good work and setting good examples.

Acknowledgements

This blog post has been a long time coming and many people have contributed to this work over several years. We are grateful to our collaborators including Matt Jones, DataOne and co-PI of MetaDIG (U.S. National Science Foundation Award 1443062), the DataCite team, especially Kelly and Mary, Jamaica Jones and the RADS team. This blog spurred conversations with FigShare and University of Bath and we are grateful for their review and comments.

References

Burger, M., Cordts, A., & Habermann, T. (2021). Wie FAIR sind unsere Metadaten?: Eine Analyse der Metadaten in den Repositorien des TIB-DOI-Services. Bausteine Forschungsdatenmanagement, (3), 1–13. https://doi.org/10.17192/bfdm.2021.3.8351

Gordon, S., & Habermann, T. (2018). The influence of community recommendations on metadata completeness. In Ecological Informatics (Vol. 43, pp. 38–51). Elsevier BV. https://doi.org/10.1016/j.ecoinf.2017.09.005

Habermann, T. (2019). MetaDIG recommendations for FAIR DataCite metadata. Front Matter. https://doi.org/10.59350/n31gm-kg364

Habermann, T. (2021a). A PID Feast for Research – PIDapalooza 2021. Front Matter. https://doi.org/10.59350/ryh7v-rmy52

Habermann, T. (2021b). Can Communities Improve Metadata? Front Matter. https://doi.org/10.59350/vtt2s-jss23

Habermann, T. (2022). Universities@DataCite. Front matter. https://doi.org/10.59350/sgzhr-3kk88

Habermann, T. (2022b). Need help searching for RORs? Try RORRetriever! Front Matter. https://doi.org/10.59350/4gxfz-4kb47.

Habermann, T., (2024a). Universities@DataCite Revisited, Webinar Recording, https://www.youtube.com/watch?v=_jsqwK_pgeQ.

Habermann, T., (2024b). FAIR Evaluations of University and College Repositories at DataCite Using MetaDIG Mappings for Four Use Cases. [dataset], https://doi.org/10.5438/2chg-b074

Habermann, T. (2024c). Making the Invisible Visible: Celebrating the Year of Open Science. Front Matter. https://doi.org/10.59350/77zs1-hz764

Habermann, T. (2024d). FAIR Metadata Concepts in DataCite Metadata Schema [Data set], Zenodo. https://doi.org/10.5281/zenodo.12168626

Heath, C. and Heath, D., 2010, Switch: How to Change Things When Change Is Hard, Broadway Books, New York.

Johnston LR, Hofelich Mohr A, Herndon J, Taylor S, Carlson JR, Ge L, et al. (2024) Seek and you may (not) find: A multi-institutional analysis of where research data are shared. PLoS ONE 19(4): e0302426. https://doi.org/10.1371/journal.pone.0302426.

OSTP, 2022. The 2022 OSTP Public Access Memo. https://www.whitehouse.gov/wpcontent/uploads/2022/08/08-2022-OSTP-Public-access-Memo.pdf.

Blog

FAIR DataCite Metadata - University and College Bright Spots