Minimum Metadata
/Cite this blog as Habermann, T. (2020). Minimum Metadata. Front Matter. https://doi.org/10.59350/kanrj-qt678
Looking for New Year’s metadata resolutions? How about: Stop using sentences that include the words “minimum metadata” without specifying a use case.
Sentences that include the words “minimum metadata” come up frequently in metadata discussions, usually in the context of what a data provider wants to provide or, even more common, in the context of what should be expected of them. In the end, these sentences, and the decisions made because of them, drive important limitations on the metadata that exists to help users understand, use, and trust data. Experience, and data, clearly indicate that if you specify minimum metadata, it is all that you get.
Consider, for example, the DataCite metadata schema. In this case, the minimum metadata, specified as Mandatory, includes six dialect-independent documentation concepts: Resource Identifier, Resource Author, Resource Title, Resource Publisher, Resource Publication Date, Resource Type General (see Habermann, 2019). Figure 1 (from Habermann, 2019) shows the % of a large sample of DataCite records that include concepts that support Findability. All of mandatory DataCite elements are in this category and they easy to identify in this picture, they are the ones that occur in over 90% of the records (shown in red). Resource Type General joined the mandatory group in the last major schema upgrade during 2015 so it occurs a bit less frequently than the others which have always been mandatory.
The DataCite metadata schema also includes several properties that are recommended “for the purpose of achieving greater exposure for the resource’s metadata record, and therefore, the underlying research itself.” These properties are shown as green in Figure 1. It is clear that most data providers focus on mandatory properties rather than recommended ones. In fact, the concept Abstract is the only recommended property that occurs in more than 50% of the metadata records.
The observation that you only get minimum metadata is certainly not unique to DataCite, in fact, it is universal across all metadata repositories that I have experience with across many domains.
Conservation of Burden
In many cases the minimum metadata sentence comes up in the context of minimizing the documentation burden on data providers. “If we don’t make it easy for them, they won’t do anything” is a common variant.
No one disputes that creating well-documented, trustworthy datasets that are easy to reuse can be a difficult task. It is important to keep in mind that the burden associated with reuse is conserved so, work that is not done by the data provider inevitably falls on every single potential data re-user. There is no free lunch, so “minimum metadata” on the provider side means maximum work on the re-use side.
In a world of minimum metadata, it is not surprising that data re-use is difficult and, as a consequence, rare. After all, if the burden is not on the data provider, it is multiplied over all of the users.
Minimum = Discovery
Many of the minimum metadata discussions resolve to something like “the first step is data discovery”, so minimum metadata is almost always focused on data discovery. The FAIR (meta)data principles have broadened our perspective to include three other use cases: accessibility, interoperability, and reusability. Can we extend our definition of “minimum metadata” to include these use cases?
The MetaDIG project developed FAIR metadata recommendations and explored application to the DataCite metadata schema (Habermann, 2019). The recommendations were divided into two groups: Essential and Supporting. In the Findability case, the essential metadata elements are typically text fields that can support human searches: titles, abstracts, keywords, author names, affiliations, etc. Table 1 shows the Essential and Supporting concepts used in the assessment of Findability in DataCite metadata by Habermann, 2019.
Findable Essential | DCO** | Supporting |
Abstract* | R | |
Data Created* | R | Date Submitted |
Keyword* | R | Keyword URI |
Keyword Vocabulary* | R | Keyword Vocabulary/Ontology URI |
Resource Author* | M | Resource Author Type (Person / Organization) |
O | Resource Author Identifier* | |
Resource Author Identifier Type* | ||
Resource Author Identifier Schema URI | ||
Resource Author Affiliation* | O | Resource Author Affiliation Identifier |
Resource Author Affiliation Identifier Type* | ||
Resource Author Affiliation Identifier Schema URI | ||
Resource Identifier* | M | Resource Identifier Type* |
Resource Publication Date* | M | |
Resource Publisher* | M | |
Resource Title* | M | |
Resource Type General* | M | |
Project Sponsor Funder* | O | Project Sponsor Identifier |
Project Sponsor Identifier Type | ||
Project Sponsor Identifier Scheme | ||
Project Sponsor Identifier Scheme URI | ||
Sponsor Project Identifier* | O | |
Temporal Extent* | O | |
Spatial Extent* | R | Spatial Extent Bounding Box* |
Spatial Extent Polygon* | ||
Spatial Extent Point* | ||
Spatial Extent Bounding Place* |
The DataCite metadata dialect is focused primarily on discovery, i.e. the F in FAIR. Nevertheless, there are a number of DataCite elements that support data access and making connections between datasets, software, institutions, documentation, and people. These connections are critical in supporting interoperability and reusability.
Table 2 shows proposed Essential and Supporting concepts for the assessment of Accessibility metadata. These are primarily distribution and rights information, helping users get the data and understand what they can do with it.
Accessibility Essential | DCO** | Supporting |
Distribution Contact (role = Distributor)* Name, Family Name, Given Name (R)* |
O | Distribution Contact Identifier |
Distribution Contact Identifier Type | ||
Distribution Contact Identifier Scheme | ||
Distribution Contact Identifier Scheme URI | ||
Rights Holder (role = RightsHolder) Name, Family Name, Given Name (R) |
O | Rights Holder Identifier |
Rights Holder Identifier Type | ||
Rights Holder Identifier Scheme* | ||
Rights Holder Identifier Scheme URI | ||
Rights*, RightsURI* | O | Rights Identifier |
Rights Identifier Scheme | ||
Rights Identifier Scheme URI | ||
Resource Size | O | |
Resource URL* | M |
Table 3 shows the essential and supporting concepts for Interoperability. This use case is primarily aimed at integrating datasets into tools for analysis and comparison with other datasets. It is about the format of the data.
Interoperability Essential | Supporting | |
Resource Format* | O |
Table 4 shows the essential and supporting concepts for Reusability. Most of the elements proposed for AIR fall into the connector category, either connecting with a contact to ask questions or connecting with other resources.
Reusable Essential | Supporting | |
Resource Contact (role = Contact Person)* Name, Family Name, Given Name (R) |
O | Resource Contact Given Name |
Resource Contact Family Name | ||
Resource Contact Identifier | ||
Resource Contact Identifier Type | ||
Resource Contact Identifier Scheme | ||
Resource Contact Identifier Scheme URI | ||
Methods* | R | |
Cited By (relationType = IsCitedBy) | R | |
Described By (relationType = IsDescribedBy) | R | |
Has Metadata (relationType = HasMetadata) | R | |
Referenced By (relationType = IsReferencedBy) | R | |
Reviewed By (relationType = IsReviewedBy) | R | |
Source Of (relationType = IsSourceOf) | R | |
Supplement To (relationType = IsSupplementTo) | R |
Two Ways Out
Connecting resources is one of the important roles of DataCite metadata and these connections provide critical escapes from minimal metadata. First, the “HasMetadata” relationType, along with the relatedMetadataSchema, makes it possible to connect to more detailed metadata from a DataCite record and to let the user know the dialect of those metadata. If, for example, an organization has complete metadata compliant with ISO 19115-1, that metadata can include data quality, user feedback, instrumentation, and lineage metadata not included in DataCite metadata. Letting the user now that those metadata exist can help them understand these important aspects of the data. Second, the “DocumentedBy” relationType makes it possible to connect to documentation of the resource, typically documents rather than structured metadata. This could point to published reports or papers that contain important details about how the data were processed or why the data were collected. This information is helpful when trying to understand the data and decide if it is trustworthy.
The second way out is using more precise language in our metadata discussions, i.e. adding “for (use case)” to our sentences. Tables 2-4 can be very helpful here. Using sentences like “minimum metadata for Interoperability includes the resource format” or “minimum metadata for accessibility includes a link along with information about distribution, rights and data size” are both very reasonable. When a data provider asks, “what is the minimum metadata I need?” the best answer is “what use cases are you trying to support?” This answer helps providers understand the breadth of use cases supported by complete metadata and, hopefully, leads to creating metadata that supports accessibility, interoperability, re-use, and trust in addition to data discovery.