DataCite Metadata: Evolving to FAIRness
/Cite this blog as Habermann, T. (2021). DataCite Metadata: Evolving to FAIRness. Front Matter. https://doi.org/10.59350/ftzmw-k9q02
Metadata schema evolution reflects the progression of needs, ideas, and practices of the community that creates and uses the metadata. Version 2.0 of the DataCite Metadata Schema was released ten years ago. During this time, the schema has evolved considerably, adding capabilities and supporting many new use cases. Looking back over the last ten years helps us understand the progress DataCite has made based on input from the community and the work of the DataCite Metadata Working Group and the DataCite organization.
During the last year I have been a member of the DataCite Metadata Working Group, witnessing the schema evolution process. All of the schema updates described here are included in Appendices of the DataCite schema documentation and that is the authoritative source. The changes I summarize here are focused on improving capabilities related to FAIR use cases, mostly identifying various kinds of resources and making connections between them. I hope that focusing attention on ten years of changes together increases awareness of the many steps that DataCite has taken to help data and metadata be more FAIR and that this awareness encourages the community of metadata creators and managers to take advantage of these improvements.
Version 2.0 January 2011
Version 2.0 of the schema provided a broad foundation of “core metadata properties chosen for the accurate and consistent identification of data for citation and retrieval purposes”. This focus on citation and discovery continues to today and the five mandatory elements defined ten years ago remain today as the core of DataCite metadata (see Habermann, 2019 and Habermann, 2020). The need for evolution was clearly identified during 2011: “it is highly desirable that the schema … be able to adapt to the full extent of academic and research use cases”, along with the important goal that the evolution be driven by real use cases.
In addition to setting a foundation for citation and discovery elements, Version 2.0 also defined the kinds of research objects described in the DataCite metadata repository (resourceTypeGeneral) and the types of relationships that could be defined among objects (relationType). These initial property values, shown in Table 1, started to define the outlines of what has since become the connected web of research objects, i.e. the PID Graph. I record them here because their evolution during the last ten years is an important part of the DataCite metadata evolution.
Metadata Property
Accepted Values (Vocabulary)
contributorType
ContactPerson, DataCollector, DataManager, Editor, HostingInstitution, ProjectLeader, ProjectMember, RegistrationAgency, RegistrationAuthority, Researcher
resourceTypeGeneral
relationType
Version 2.1 – March 2011
Version 2.1 of the schema followed closely and reflected mostly evolution of the back-end technology that supported the DataCite repository. A namespace was added to the schema to support validation and OAI PMH compatibility, enforcement was implemented for mandatory elements, and the format of the date property was extended to include months and days.
This date format change reflects small evolutionary changes in the DataCite community. First, they needed more temporal resolution for specifying events in the provenance of resources, and second, DataCite responded to that need using a well-established International Standard created by the W3C. Connecting a community to broader standards is an important role for a large-scale repository.
Version 2.2 – July 2011
Small but interesting evolutionary tweaks continued with Version 2.2. First, URL was added as an identifier type, recognizing that URLs had become important identifiers in many contexts regardless of the details of the difference between URL and URI. Second, the importance of series in citations was recognized by adding SeriesInformation as a type of description, i.e. as free text. Finally, the importance of models as a resource type, essentially a type of dataset, was recognized by adding Model to the resourceTypeGeneral vocabulary.
A bigger change to the vocabulary for contributor type, adding eight elements (70%, see Table 2), reflected the need to include new roles in the research web. These additions also reflected the difficulties inherent with shared vocabularies: are sponsor and funder really different roles?
Metadata Property
Accepted Values (Vocabulary)
contributorType
Producer, Distributor, RelatedPerson, Supervisor, Sponsor, Funder, RightsHolder
resourceTypeGeneral
It is interesting to note that many thousands of metadata records based on Version 2.2 of the schema still exist in the DataCite repository. Unfortunately, none of these records take advantage of the schema improvements made during the last decade.
Version 3 – July 2013
The first major schema upgrade, to Version 3, occurred during July 2013 with significant increases in capabilities. First, the addition of geolocation information along with the Collected dateType extended the DataCite schema focus on discovery by supporting spatial/temporal portals and map or timeline related discovery tools. At the same time geoLocationPlace was added, supporting spatial place names (i.e., spatial keywords). It should be noted that the geolocation information pertains to the “spatial region or named place where the data was gathered or about which the data is focused” rather than the location of the institution that published the data or other resource.
Along with this expansion of discovery capabilities, DataCite acknowledged the need for the ability to link external, local, and /or domain-specific metadata to the DataCite record by adding the HasMetadata and IsMetadataFor relation types with associated schema information. These additions evolved the DataCite metadata into a discovery hub with connections to more detailed metadata that supports more advanced use cases, i.e., access, interoperability and re-use. The addition of Methods as a descriptionType also provided new support for these use cases, particularly re-use.
This version added the property rightsURI to Rights and increased the cardinality of Rights from one to many, acknowledging the need for unambiguously identifying licensing information by adding identifiers to the metadata, and for multiple licenses in some cases.
Two interesting changes occurred outside of the actual schema in this version. First, the documentation identified strongly recommended properties that can be used to achieve greater exposure for the resource’s metadata, stating (in bold text) “Those clients who wish to enhance the prospects that their metadata will be found, cited and linked to original research are strongly encouraged to submit the Recommended as well as Mandatory set of properties.” This change in the DataCite recommendations, from mandatory to mandatory plus recommended, and the connection of the change to client wishes, was an important message to DataCite metadata providers and was an early indication of the observation discussed by Habermann, 2020: if you specify minimum metadata, i.e. mandatory fields, that is all you get. Fenner, 2019 is the most recent reiteration of a similar message by DataCite.
Version 3 of the schema documentation introduced the concept of Other DataCite Services with Metadata Store, Metadata Search and Content Negotiation for the first time. This is not a change in the schema, but it does reflect a new emphasis in the environment towards sharing metadata and related services. This change is also reflected in the connection of the phrase “Please note that DataCite reserves the right to share metadata with information indexes and other entities” in the documentation with the DataCite Business Principles.
Providing services on top of the metadata is a critical part of the DataCite business plan. Of course, in order for those services to be effective, the metadata must include the properties that drive the services, i.e. complete metadata is critical to the survival of DataCite. Note that the Crossref Participation Reports reflect the same connection.
Metadata Property
Accepted Values (Vocabulary)
contributorType
ResearchGroup, Other
resourceTypeGeneral
relationType
Version 3.1 – October 2014
Two small, but significant, changes were introduced in Version 3.1 of the schema: the affiliation attribute (as text) was added to creators and contributors, and relationTypes were added for reviews and resource chains. The first of these introduced research organizations into the schema for the first time, adding a connection to research organizations and an important discovery path. The second introduced the ability to connect reviews into the research network driven by the desire to increase transparency and openness in the scientific review process as well as to link resources to one another to create chains of provenance and processing.
Metadata Property
Accepted Values (Vocabulary)
contributorType
DataCurator
relationType
Version 4.0 – September 2016
Two significant changes were introduced in Version 4.0 of the schema: resourceTypeGeneral became mandatory and funder evolved from a contributorType to FundingReference, a structure including information about and identifiers for funders and awards. The promotion of resourceTypeGeneral from one of the two “most important recommended fields” (V 3.1) to mandatory, reflected the increasing diversity of the resource types described by DataCite metadata and the related increase in importance of resource type as a discovery facet.
A small, but important, change in Version 4.0 was allowing multiple identifiers for people (creators and contributors) acknowledging that many people had legacy, typically local, identifiers and needed to add more universal identifiers (i.e., ORCIDS) to their metadata. Note that this is similar to the increase in cardinality for rights that was included in Version 3.0.
Finally, the valueURI sub-property was added to subjects in order to accommodate linked-data approaches and the increased utilization of ontology items with unique identifiers as subject keywords. Again, like the addition of RightsURI, this reflects the broadening role of identifiers as important connectors in metadata records.
Version 4.1 – October 2017
The theme of Version 4.1 of the schema was software citation, adopted “in response to increasing interest within the community”. The documentation noted that “very few actual schema changes were required, but substantial modifications needed to be made to the documentation to assist those registering DOIs for software”. In other words, this was a reinterpretation of the existing properties and capabilities rather than an addition of new ones.
Metadata Property
Accepted Values (Vocabulary)
resourceTypeGeneral
relationType
The V4.0 promotion of resourceTypeGeneral continued in Version 4.1 with the addition of resourceTypeGeneral as an attribute of relatedIdentifier. The growing diversity of resources in the research web now required letting users know what kind of thing they were pointing to rather than just the identifier of that thing, a significant increase in the richness of the PID Graph
Version 3.1 introduced research organizations into the schema as affiliations of humans. Version 4.1 made organizations creators and contributors in their own right (even thought organizations had been included as contributorTypes since the beginning (HostingInstitution, RegistrationAgency, RegistrationAuthority, Distributor, Sponsor, Funder, …). More importantly, this allowed searches for organizations in the metadata to be made more precisely.
Version 4.2 – March 2019
Version 4.2 of the schema continued the expansion of the rights property that started in V3.0 (rightsURI) with the addition of other types of identifiers (SPDX) and schemas for unambiguously identifying licenses associated with resources.
The description of DataCite Services changed in V4.2, reflecting the broadening of services offered by DataCite on top of their metadata. The new services described included Fabrica for creating metadata and Event Data for finding connections. It is interesting to note that Event Data was proposed as a tool for metadata creators, i.e. DataCite clients, to add connections to their metadata rather than DataCite reflecting the DataCite business philosophy that “Only the organization publishing the DOI can update the metadata, and it is important to keep it this way to have a single authoritative source.”
Metadata Property
Accepted Values (Vocabulary)
relationType
Version 4.3 – August 2019
Version 4.3 of the schema once again extended the presence of research organizations and the reach of identifiers in DataCite metadata, bringing structure to creator and contributor affiliations by adding identifiers and associated schema information.
Trends
Figure 1 summarizes the evolution of DataCite metadata described here and highlights changes related to increasing “FAIRness”. Three types of changes are shown:
• bold text shows changes related to mandatory fields,
• Italic text shows additions to several shared vocabularies,
• plain text shows properties introduced in various schema versions.
Overall, these changes reflect a marked increase in the capability to describe data in a FAIR way and several other long-term trends.
First, there is a clear trend towards increasing the diversity of resources that can be described with DataCite metadata (resourceTypeGeneral) and the diversity of kinds of things that can be connected using that metadata (relatedIdentifiers and relationType). These are both critical ingredients in building out the PID Graph and the recent announcement of the DataCite Commons showcases the powerful results of this trend.
There is also a clear trend of increasing structure in the metadata. Many documentation elements start as free text and evolve by adding more structure over time. Examples include rights, which started as free text in V2.0 and is now a structured property including identifiers and related metadata, as well as funders and affiliations which followed the same path.
Future
Of course, these improvements in the DataCite Metadata Schema are only productive when they are adopted by metadata creators and expected by data users. In addition to working with the DataCite Metadata Working Group, I have also been interested in evaluating and understanding the “FAIRness” of DataCite metadata collections. That work (Habermann, 2019) indicates that many of the DataCite metadata elements that support the FAIR use cases remain uncommon in DataCite metadata, particularly those that support the AIR use cases. This observation is also reflected in recent blogs about making the most of available metadata (Fenner, 2019) and subject keywords in DataCite metadata (Habermann, 2020). The schema changes are clearly providing a roadmap for community evolution instead of reflecting changes implemented over the last decade.
Metadata and organizational change are two difficult problems we think about at Metadata Game Changers and improving metadata brings them together. Measuring metadata creates an important baseline for identifying good examples and fruitful improvement opportunities, essential initial steps in the metadata improvement process. If you are interested in improving your DataCite metadata, please contact us at Metadata Game Changers.