DataCite Facets and Metadata Completeness

Cite this blog as Habermann, T. (2025). DataCite Facets and Metadata Completeness, Front Matter. https://doi.org/10.59350/55etd-e8154.

Background

A facet is an item, often from a controlled list, that provides a count of records in a query result with particular values for the related metadata element and can be used to filter search results using that element. For example, a “Published” facet would provide a count of records published / year and could be used to select all the items in a query result published during a particular year. Likewise, the “Resource Type” facet can show the number of records for different resource types and then be used to select all the items in a query with a particular resource type.  

Several years ago, I described how facets could help us understand the big picture of DataCite metadata usage, answering questions like “What are the top ten DataCite repositories for each DataCite resource type?” (Habermann, 2022). DataCite includes facet results for every API query (see Table A1 for descriptions), so they can also be used to provide overviews of a single repository. Some facet results are also included in the DataCite Commons. For example, the DataCite Commons page for Metadata Game Changers includes facets on the left side of the page that provide data about our small collection of metadata records and can be used to filter our records using Creators & Contributors, Publication Year, Work Type, and several other facets. These facet results can also provide insights into repository characteristics and into potential opportunities for improving repository content.

Facets and Completeness

Many useful repository measures focus on completeness of the metadata, i.e., the portion of records in the repository that include some metadata element. The DataCite facet data can provide some insight into completeness, but we must keep in mind that the facet data are limited to top ten values for most facets (except for published and resourceTypes, which can be > 10).

The relationship between facet number (the number of facets), facet totals (the total of the facet value counts), and completeness is illustrated schematically in Figure 1 and described here using terminology for facet statistics defined in Table A2:

  • If the metadata element for a particular facet does not exist in the repository, the facet is not included in the DataCite results, i.e., facet number = 0.

  • If the facet number = 1 and the facet total = the number of records, the metadata element is complete, i.e., it exists in every record, and it always has the same value.

  • If the facet number is less than ten and the facet total is less than the total number of records, the top ten facet values cover all records with values and the facet total provides a measure of completeness for the metadata element (completeness = facet coverage = facet total / number of records).

  • If the facet number = 10, even if the facet total is less than the number of records, completeness cannot be determined as there are unknown values for the facet metadata element and they are not included in the total.

  • The facets resourceType and published are exceptions to these patterns because they can have facet numbers greater than 10. The facet total is the total number of records with the related metadata element, and completeness = facet coverage. Both of these are mandatory metadata elements in Version 4 of schema so completeness is typically 100%.

Figure 1. The relationship between the number of facets (facet number), the number of records with facets (facet total) and completeness varies with facet number and the facet total.

Dryad Facets

The Dryad Data Repository has a well-established history as a curated collection of datasets connected to research papers and as an “activist” repository committed to continuous improvement. Dryad has been using DataCite for over a decade. The Dryad facets shown in Table A3 provide examples of the features described above and shown in Figure 1. Those examples are described in Table 1.

Type

Examples

facet number = 1 and
facet total = number of records.

In these cases, the repository content is complete.

Three Dryad facets are in this category:

  1. states gives the number of publicly available records in the repository. The records in any DataCite query result are found by the query, so the states facet is always “Findable”.
  2. provider gives the provider identifiers of records in the repository. Most repositories have only one provider and, in this case, the single provider = DRYAD.
  3. client gives the identifier of the repository, termed client in DataCite. Repositories have only one client-id, and in this case, client = dryad.dryad.

All three of these facets have only one value and a total of 162,101. They also have coverage = 100%. These metadata for these facets are generated by the DataCite system, i.e. they are not included in the DataCite metadata schema.

facet number < 10 and
facet total < number of records.

In these cases, the repository completeness is equal to the facet coverage.

Four Dryad Facets are in this category:

  1. schemaVersions gives the major schema versions, i.e. 2, 3, or 4, found in the repository. The current version of the schema is 4.6 and Schema 4 is the most common value of this facet. Repositories that have been using DataCite since before September 2016, when Version 4.0 of the schema was released, have some records in earlier versions, even Version 2!
  2. LinkCheckStatus gives counts of http response codes generated by the DataCite Link Checker Service that periodically checks a random sampling of DOIs to verify that they still resolve to a valid URL. Values not equal to 200 indicate an error. DOIs with specific error values can be found using the DataCite API.
  3. license gives the licenses that are included in the repository metadata. In this case, 39% of the records in Dryad have one of three licenses with CCO-1.0 being the most common, occurring in 38% of the records.
  4. resourceType gives the resourceTypeGeneral values observed in the repository. Dryad is dominated by Datasets, 161,912 records, and includes < 20 records with several other types.

Facet number = 10.

In these cases, there is no information about completeness, but the facet values may provide interesting repository insights.

Nine Dryad facets are in this category: views, down-loads, prefixes, registered, created, subjects, citations, fieldsOfScience, affiliations, published.

Note that two of these facets: views and downloads occur more times than the number of records (coverage > 100%) which reflects users accessing datasets in the repository. Also note that it is unusual for a single repository to support multiple prefixes, ten or more in this case. This reflects repository subsets created for some reason, perhaps separate repository members.

Table 1. Facet examples from Dryad facets shown in Table A3.

Facet Reports

As suggested above, these facet data can provide insights into repository characteristics and behaviors that may be interesting or useful for repository management and improvement. In the Dryad case, there are three such insights.

First, the views and downloads facets, with over one and ten million occurrences, demonstrate that Dryad is achieving their goals of providing access to and encouraging re-use of the datasets in the repository. This was also consistent with the observation that Dryad is among the most commonly used repositories for datasets funded by NSF, USGS, and USAID (Habermann et al. 2023

Second, the schemaVersion facet indicates that Dryad includes over 50,000 records that are using versions 2 or 3 of the DataCite metadata schema. These older versions have limited support for many FAIR DataCite metadata elements and, as of January 2025, they are no longer accepted for the creation of new records. For example, these schema versions must be updated to add identifiers for research organizations, licenses, and funders or several kinds of relations or resourceTypes which are included in the more recent versions.

Third, there are nearly 100,000 records in the Dryad repository that do not include license information in their metadata. This is unexpected because Dryad does not accept any files with licensing terms that are incompatible with the Creative Commons Zero waiver (CC0), indicating that the data are in the public domain and recommends that license as a good data practice.

If you are a repository manager and would like to explore a complimentary facet report for your DataCite repository, please contact me at ted@metadatagamechangers.com. More detailed reports focused on Identifier Connectivity and FAIR DataCite Metadata are also available.

References

Habermann, T. (2022). DataCite Facets: Understanding DataCite Usage. Front Matter. https://doi.org/10.59350/wwv9g-31p18

Habermann, T., Jones, J., Ratner, H., & Packer, T. (2023). INFORMATE: Where Are the Data? Front Matter. https://doi.org/10.54900/vnevh-vaw22

Table A1. DataCite API Facets

Facet
Description
affiliations
The affiliations facet might better be named affiliationIdentifiers as it gives data about RORs used as affiliation identifiers.
certificates
The certificates facet shows the repository certificates (i.e. CoreTrustSeal).
citations
The citations column shows the number of citations to items in the repository each year.
clients
The clients column shows the number of clients included in the query result. Clients are essentially repositories managed by a provider. The repository id combines the provider and the client separated by a '.', i.e. id = provider.client. The queries used to create these facet reports are repository queries so they only have one client.
created
The created column shows the number of DOIs created each year.
downloads
The downloads column shows the number of downloads of items in the repository each year.
fieldsOfScience
The Fields of Science are a standard list of high-level scientific domains or fields developed by UNESCO. They are used by DataCite to provide some standardization in the subject fields (which are free text). FOS keywords are written as Fos:value in order to identify them and include them in this facet.
licenses
The licenses column shows the licenses used in the repository and the number of occurrences of each.
linkChecksStatus
The linkChecksStatus column shows the status of the landing pages of DOIs in the search (when and if last checked) with counts.
prefixes
DOI prefixes are the numbers, startng with '10.' that occur before the first slash in a DOI. Most repositories include DOIs with the same predix, although there are some large repositories that use prefixes to group DOIs within the repository. These have more than one prefix.
providers
The providers column shows the number of providers included in the query result. Providers are DataCite members and may have more than one repository. The first part of the repository id, i.e. before the '.' is an abbreviation for the provider. The queries used to create these facet reports are repository queries so they only have one provider.
published
The published column shows the number of resources published each year. Resource publication dates can be different than DOI creation or registration dates so it is not unusual for the published date to have more facets (years) than the created or registered facets.
registered
The registered column shows the number of DOIs registered each year.
resourceTypes
The DataCite metadata schema supports DOIs for many types of resources. The resourceType facet shows the number of occurrences of each resource type. The common column shows the most common resource type in the repository. It is not unusual for a repository to focus on a small number of resource types, even a single resource type. In these cases the number column is 1 and the values column shows only one value with a count that matches the number of records in the repository.
schemaVersions
The DataCite metadata schema has a number of versions with 4.6 being the most recent version. The schemaVersions colum lists schema versions present in the repository. All versions earlier than 4 are deprecated, so if one of those is present, it is flagged as a warning.
states
DataCite DOIs can exist in several different states: findable, and all records have a state. Most records are in the findable state, i.e. they are available through the API and can be found. In those cases, the number of states is 1 and the total is the number of records in the repository.
subjects
The DataCite metadata schema defines subjects as free text keywords describing topics relevant to DOIs. The subjects for a repository provide and overview of the domains covered by the repository.
views
The views column shows the number of views of items in the repository each year.

Table A2. Facet Statistics

Statistic
Description
number
The number of facet values (between 1 and 10 for most facets)
max
The number of occurrences of the most common facet value, <= the number of records
common
The most common facet value
total
The total number of resources in the top 10, the sum of the facet values
homogeneity
An indicator of homogeneity of the facet: maximum count / total count (0.1 = uniform, 1.0 = single item)
coverage
The % of all records covered by the top 10 (numbers close to 100% are good)

Table A3. Dryad Facets

repository_id
facet
number
total
common
max
HI
coverage
values
dryad.dryad
states
1
162101
findable
162101
100%
100%
Findable (162101)
dryad.dryad
providers
1
162101
dryad
162101
100%
100%
Dryad (162101)
dryad.dryad
clients
1
162101
DRYAD
162101
100%
100%
DRYAD (162101)
dryad.dryad
schemaVersions
3
161372
4
106666
66%
100%
Schema 4 (106666), Schema 3 (52279), Schema 2.2 (2427)
dryad.dryad
licenses
3
62667
cc0-1.0
62339
38%
39%
CC0-1.0 (62339), CC-BY-4.0 (327), CC-BY-3.0 (1)
dryad.dryad
linkChecksStatus
3
1635
200
1569
1%
1%
200 (1569), 404 (57), 503 (9)
dryad.dryad
resourceTypes
5
161931
dataset
161912
100%
100%
Dataset (161912), Other (8), Collection (4), Software (4), Text (3)
dryad.dryad
views
10
10114895
2019
2285313
1410%
6240%
2026 (883), 2025 (4554), 2024 (91693), 2023 (541418), 2022 (854837), 2021 (1337200), 2020 (1665138), 2019 (2285313), 2018 (1659594), 2017 (1674265)
dryad.dryad
downloads
10
1445987
2019
328230
202%
892%
2026 (31), 2025 (1782), 2024 (31614), 2023 (78732), 2022 (113245), 2021 (167081), 2020 (242077), 2019 (328230), 2018 (240715), 2017 (242480)
dryad.dryad
prefixes
10
161814
10.5061
158953
98%
100%
10.5061 (158953), 10.25338 (646), 10.6078 (475), 10.7280 (345),10.7272 (283), 10.25349 (257), 10.15146 (233), 10.5068 (224), 10.7291 (224), 10.6086 (174)
dryad.dryad
registered
10
112771
2018
22174
14%
70%
2025 (464), 2024 (5766), 2023 (6332), 2022(6154), 2021 (6706), 2020 (6816), 2019 (18497), 2018 (22174), 2017 (20762), 2016 (19100)
dryad.dryad
created
10
112770
2018
22174
14%
70%
2025 (465), 2024 (5766), 2023 (6332), 2022 (6154), 2021 (6706), 2020 (6816), 2019 (18497), 2018 (22174), 2017 (20760), 2016 (19100)
dryad.dryad
subjects
10
43624
FOS: Biological sciences
14072
9%
27%
Fos: Biological Sciences (14072), Holocene (9171), North America (3369), Adaptation (3106), Population Genetics Empirical (2635), Europe (2446), Pleistocene (2393), Speciation (2220), Australia (2217), Fos: Earth And Related Environmental Sciences (1995)
dryad.dryad
citations
10
38479
2019
6982
4%
24%
2026 (2), 2025 (261), 2024 (2461), 2023 (3764), 2022 (4475), 2021 (5165), 2020 (5144), 2019 (6982), 2018 (5466), 2017 (4759)
dryad.dryad
fieldsOfScience
10
20521
biological_sciences
14072
9%
13%
Biological sciences (14072), Earth and related environmental sciences (1995), Natural sciences (1814), Agricultural sciences (629), Medical and health sciences (523), Health sciences (356), Computer and information sciences (326), Sociology (312), Physical sciences (260), Other natural sciences (234)
dryad.dryad
affiliations
10
8582
ror.org/05rrcem69
1194
1%
5%
University of California; Davis (1194), University of California; Berkeley (973), University of Oxford (865), University of British Columbia (849), French National Centre for Scientific Research (845), Cornell University (840), University of Florida (792), Chinese Academy of Sciences (773), Dryad Digital Repository (729), University of Washington (722)
dryad.dryad
published
16
161153
2018
21753
13%
99%
2025 (463), 2024 (4159), 2023 (6492), 2022 (5961), 2021 (7046), 2020 (7942), 2019 (21617), 2018 (21753), 2017 (20420), 2016 (18919), 2015 (17778), 2014 (11378), 2013 (8739), 2012 (5059), 2011 (2707), 2010 (720)