DataCite Facets and Metadata Completeness
/Cite this blog as Habermann, T. (2025). DataCite Facets and Metadata Completeness, Front Matter. https://doi.org/10.59350/55etd-e8154.
Background
A facet is an item, often from a controlled list, that provides a count of records in a query result with particular values for the related metadata element and can be used to filter search results using that element. For example, a “Published” facet would provide a count of records published / year and could be used to select all the items in a query result published during a particular year. Likewise, the “Resource Type” facet can show the number of records for different resource types and then be used to select all the items in a query with a particular resource type.
Several years ago, I described how facets could help us understand the big picture of DataCite metadata usage, answering questions like “What are the top ten DataCite repositories for each DataCite resource type?” (Habermann, 2022). DataCite includes facet results for every API query (see Table A1 for descriptions), so they can also be used to provide overviews of a single repository. Some facet results are also included in the DataCite Commons. For example, the DataCite Commons page for Metadata Game Changers includes facets on the left side of the page that provide data about our small collection of metadata records and can be used to filter our records using Creators & Contributors, Publication Year, Work Type, and several other facets. These facet results can also provide insights into repository characteristics and into potential opportunities for improving repository content.
Facets and Completeness
Many useful repository measures focus on completeness of the metadata, i.e., the portion of records in the repository that include some metadata element. The DataCite facet data can provide some insight into completeness, but we must keep in mind that the facet data are limited to top ten values for most facets (except for published and resourceTypes, which can be > 10).
The relationship between facet number (the number of facets), facet totals (the total of the facet value counts), and completeness is illustrated schematically in Figure 1 and described here using terminology for facet statistics defined in Table A2:
If the metadata element for a particular facet does not exist in the repository, the facet is not included in the DataCite results, i.e., facet number = 0.
If the facet number = 1 and the facet total = the number of records, the metadata element is complete, i.e., it exists in every record, and it always has the same value.
If the facet number is less than ten and the facet total is less than the total number of records, the top ten facet values cover all records with values and the facet total provides a measure of completeness for the metadata element (completeness = facet coverage = facet total / number of records).
If the facet number = 10, even if the facet total is less than the number of records, completeness cannot be determined as there are unknown values for the facet metadata element and they are not included in the total.
The facets resourceType and published are exceptions to these patterns because they can have facet numbers greater than 10. The facet total is the total number of records with the related metadata element, and completeness = facet coverage. Both of these are mandatory metadata elements in Version 4 of schema so completeness is typically 100%.
Figure 1. The relationship between the number of facets (facet number), the number of records with facets (facet total) and completeness varies with facet number and the facet total.
Dryad Facets
The Dryad Data Repository has a well-established history as a curated collection of datasets connected to research papers and as an “activist” repository committed to continuous improvement. Dryad has been using DataCite for over a decade. The Dryad facets shown in Table A3 provide examples of the features described above and shown in Figure 1. Those examples are described in Table 1.
Type
Examples
facet number = 1 and
facet total = number of records.
In these cases, the repository content is complete.
Three Dryad facets are in this category:
- states gives the number of publicly available records in the repository. The records in any DataCite query result are found by the query, so the states facet is always “Findable”.
- provider gives the provider identifiers of records in the repository. Most repositories have only one provider and, in this case, the single provider = DRYAD.
- client gives the identifier of the repository, termed client in DataCite. Repositories have only one client-id, and in this case, client = dryad.dryad.
All three of these facets have only one value and a total of 162,101. They also have coverage = 100%. These metadata for these facets are generated by the DataCite system, i.e. they are not included in the DataCite metadata schema.
facet number < 10 and
facet total < number of records.
In these cases, the repository completeness is equal to the facet coverage.
Four Dryad Facets are in this category:
- schemaVersions gives the major schema versions, i.e. 2, 3, or 4, found in the repository. The current version of the schema is 4.6 and Schema 4 is the most common value of this facet. Repositories that have been using DataCite since before September 2016, when Version 4.0 of the schema was released, have some records in earlier versions, even Version 2!
- LinkCheckStatus gives counts of http response codes generated by the DataCite Link Checker Service that periodically checks a random sampling of DOIs to verify that they still resolve to a valid URL. Values not equal to 200 indicate an error. DOIs with specific error values can be found using the DataCite API.
- license gives the licenses that are included in the repository metadata. In this case, 39% of the records in Dryad have one of three licenses with CCO-1.0 being the most common, occurring in 38% of the records.
- resourceType gives the resourceTypeGeneral values observed in the repository. Dryad is dominated by Datasets, 161,912 records, and includes < 20 records with several other types.
Facet number = 10.
In these cases, there is no information about completeness, but the facet values may provide interesting repository insights.
Nine Dryad facets are in this category: views, down-loads, prefixes, registered, created, subjects, citations, fieldsOfScience, affiliations, published.
Note that two of these facets: views and downloads occur more times than the number of records (coverage > 100%) which reflects users accessing datasets in the repository. Also note that it is unusual for a single repository to support multiple prefixes, ten or more in this case. This reflects repository subsets created for some reason, perhaps separate repository members.
Table 1. Facet examples from Dryad facets shown in Table A3.
Facet Reports
As suggested above, these facet data can provide insights into repository characteristics and behaviors that may be interesting or useful for repository management and improvement. In the Dryad case, there are three such insights.
First, the views and downloads facets, with over one and ten million occurrences, demonstrate that Dryad is achieving their goals of providing access to and encouraging re-use of the datasets in the repository. This was also consistent with the observation that Dryad is among the most commonly used repositories for datasets funded by NSF, USGS, and USAID (Habermann et al. 2023)
Second, the schemaVersion facet indicates that Dryad includes over 50,000 records that are using versions 2 or 3 of the DataCite metadata schema. These older versions have limited support for many FAIR DataCite metadata elements and, as of January 2025, they are no longer accepted for the creation of new records. For example, these schema versions must be updated to add identifiers for research organizations, licenses, and funders or several kinds of relations or resourceTypes which are included in the more recent versions.
Third, there are nearly 100,000 records in the Dryad repository that do not include license information in their metadata. This is unexpected because Dryad does not accept any files with licensing terms that are incompatible with the Creative Commons Zero waiver (CC0), indicating that the data are in the public domain and recommends that license as a good data practice.
If you are a repository manager and would like to explore a complimentary facet report for your DataCite repository, please contact me at ted@metadatagamechangers.com. More detailed reports focused on Identifier Connectivity and FAIR DataCite Metadata are also available.
References
Habermann, T. (2022). DataCite Facets: Understanding DataCite Usage. Front Matter. https://doi.org/10.59350/wwv9g-31p18
Habermann, T., Jones, J., Ratner, H., & Packer, T. (2023). INFORMATE: Where Are the Data? Front Matter. https://doi.org/10.54900/vnevh-vaw22