Metadata Life Cycle: Mountain or Superhighway?

Cite this blog as Habermann, T. (2022). Metadata Life Cycle: Mountain or Superhighway? Front Matter. https://doi.org/10.59350/86jd5-wpv70

This work could not have been done without Lisa Johnson, the Director of the Data Repository for the University of Minnesota (DRUM). Thanks Lisa!

Introduction

During the early days, repositories had metadata collections in one dialect that served, to some degree, all needs: discovery, identification, access, interoperability, and reuse. Native metadata were either hidden from users or were served in the native representation, typically XML or JSON. Over the last several years, with the emergence of the global research infrastructure (DataCite, ORCID, ROR, Crossref / FundRef, …) and the increase in special-purpose metadata dialects (e.g. schema.org, STAC, IGSN, …), the landscape has changed considerably. In this landscape, metadata can follow many complex pathways, ending up in multiple services and repositories. Metadata can be gained along these paths, but, more typically, they are lost.

Figure 1 illustrates one of metadata’s journeys through the layers of “Metadata Mountain” in the DRUM Repository at the University of Minnesota. Metadata and documentation are created through active partnerships between researchers and the repository during the submission process. Some of the information is structured, i.e. the metadata, and some is unstructured, i.e. the documentation, and together they make up the first layer of Metadata Mountain, providing the most complete basis for understanding the data and reproducing scientific results based on it.

Figure 1. Layers of documentation and metadata in the DRUM repository.

The next step on the path up Metadata Mountain is extracting the structured metadata. In the DRUM case, these metadata are represented using Dublin Core and they become the most complete metadata (structured information) for the resource. A subset of these metadata are made available through an OAI-PMH feed (in Dublin Core). In the final step up Metadata Mountain, a smaller metadata subset is transferred to DataCite to receive a DOI for the dataset. This final subset includes only the mandatory DataCite elements for most datasets, those required for citation: Identifier, Creator, Title, Publisher, Publication Year, Resource Type and simple access: Landing page URL. This minimal package is not unique to DRUM, it is ubiquitous across most repositories in DataCite (Habermann, 2020). 

Most of the metadata created in the second layer is no longer included at this point along the path to the top of Metadata Mountain. It remains in the original repository, but it is not available to the global research infrastructure of connections.

This metadata attrition on the climb up Metadata Mountain is not unique to DRUM. Figure 2 compares FAIR completeness scores in four categories from two sets of DataCite records: 1) those created and managed by five different Institutional Repositories (dashed lines) and 2) those created by researchers from the same institutions as part of the submission process into other repositories like Zenodo, Dryad, DataVerse and others (solid lines). The data show that the metadata entered by researchers into other repositories is considerably more complete with respect to these FAIR recommendations than those entered by the Institutional Repositories. Those metadata are from the lower layers of the Metadata Mountain. They exist in the Institutional Repositories, but they don’t make it to the global infrastructure.

Figure 2. Comparison of FAIRness for DataCite metadata from institutional repositories (dashed lines) and from datasets submitted by institutional researchers to other repositories (solid lines).

The schematic depiction of metadata layers in Figure 1 and the data in Figure 2 clearly indicate that differences between metadata content in Metadata Mountain are related to transfer processes rather than to creation processes. When these transfers are improved, attrition along the metadata pathway can be decreased or eliminated, metadata content is maintained and Metadata Mountain becomes the Metadata Superhighway shown in Figure 3, increasing the content and the potential benefits of the global research infrastructure.

Figure 3. Valuable metadata collected by institutional repositories needs to be included in the global research infrastructure to achieve their potential. The full content needs to be included through the entire life cycle.

Improving Completeness of DRUM Metadata in the Global Infrastructure

“Measurement is the first step that leads to control and, eventually, to improvement. If you can’t measure something, you can’t understand it. If you can’t understand it, … you can’t improve it.” H. James Harrington

As Harrington and others have noted, measuring something is the first step towards understanding and improving it. Tools for measuring “FAIRness” of DataCite metadata were described and demonstrated for metadata from ~150 DataCite repositories managed by the German Technical Information Library (TIB). The same approach can be used to quantify FAIRness in DRUM and, more importantly, to measure improvements as new metadata are transferred to DataCite.

Figure 4 shows results of evaluating FAIRness of DRUM metadata in DataCite, i.e. associated with DataCite DOIs. The elements included in the evaluation are shown in four groups: FAIR Findable Essential, FAIR Findable Support, FAIR AIR Essential, and FAIR AIR Support, the same groups shown in Figure 2. The pattern seen here is very similar to that seen across all DataCite repositories: the mandatory elements are included in all records, and other elements (recommended and optional) are rare. This reflects the limited “get a DOI” use case that DataCite addresses for most of its members.

Figure 4. Completeness of fifty-seven DataCite metadata elements related to Findability and Access, Interoperability, and Reusability (AIR) in the DRUM repository.

Examining metadata in layer 2 of DRUM, we identified a number of elements populated across many DRUM records but not populated in the DataCite schema. These include elements in all four categories, shown in Table 1. In particular, findability in full-text searches is improved by adding abstracts and keywords to the records; funder search results are improved with funder names, identifiers, and award numbers; connections to papers and other resources are made with ReferencedBy; and re-use is supported with license information and full-text technical information.

Category Metadata Concepts
Findable EssentialAbstract, Funder, Funder Project Identifier (Award Number), Re-source Type, Subject
Findable SupportDates, Funder Identifier, Funder Identifier Type
AIR EssentialDates, ReferencedBy, TechnicalInfo, Rights
AIR SupportRights URI

Table 1. Metadata concepts added to DataCite from DRUM.

The content of these elements was extracted from the DRUM records, translated into the DataCite metadata schema, and used to update the DataCite metadata. Figure 5 shows the evaluation results for the updated records using the same criteria used with the original records. The completeness over all categories improved from 15% to 32%. The Figure shows visually striking improvements for many elements related to Findability in the upper left, but the size of the increases in the other categories, from < 10% to > 20% in two cases, are also significant, particularly because these are categories that are generally more difficult to populate.  

Figure 5. Completeness of fifty-seven DataCite metadata elements related to Findability and Access, Interoperability, and Reusability (AIR) in the DataCite metadata.

Conclusion

The most challenging steps along the path up Metadata Mountain occur during metadata creation in partnerships between researchers, data managers and curators who many times have different goals on different time-scales: short-term goals of getting identifiers so results can be published and long-term goals related to trustworthy data that can be re-used. The metadata created during this phase enables all elements of FAIR and is, therefore, the most valuable asset in the system. Historically the principle role of the global research infrastructure has been to provide persistent identifiers for resources and only minimum metadata, i.e. the six elements that are complete in Figure 4, are required to get those identifiers. Other valuable metadata are created, but they don’t make it to the summit of metadata mountain.

The data in Figure 2 demonstrate the impact that this common practice has on the completeness and FAIRness of metadata in the global research infrastructure, and they demonstrate that the process of metadata transfer to DataCite is an important choke point in populating and increasing the value of the global infrastructure. The impact of improving the transfer process is demonstrated in Figure 4 and Figure 5 – significant increases in metadata completeness and in support for all FAIR principles.

Clayton Christensen described the Resources – Processes – Values (RPV) framework for organizational capabilities and the challenges that these elements pose for organizational change. All three of these elements contribute to an organizations capability for innovation and the difficulty of changing them increases from resources to values. The metadata transfer problem described here is a good example of the process part of this change equation and changing that process is an important step towards change.

Values are the way that people in organizations think and make decisions and small choices about what they do at work. The idea that DataCite exists only to provide DOIs is deeply embedded in repository thought processes and this idea, i.e. this value, needs to evolve. The research community needs to think about DataCite (and other elements of the global research infrastructure) as powerful resources for describing and connecting the myriad of resources that make up the connected research world. We need to maximize the information that we add into this system to maximize the benefits we can get out of it.

Acknowledgments

This work was funded by the National Science Foundation (https://ror.org/021nxhr62) Award 2135874.