DataCite Facets: Understanding DataCite Usage

Introduction

The DataCite Metadata Schema has evolved considerably over more than a decade and now includes a variety of metadata elements, resource types, related resources, and contributor types in millions of metadata records from over 2000 repositories. The DataCite Metadata Working Group has overseen this evolution and works with DataCite members and the DataCite Board to chart the path forward for the metadata schema. Understanding how DataCite metadata is currently being used provides critical background for this group. I will describe how the DataCite metadata facets help me improve my understanding of DataCite metadata usage in this blog and introduce some software that can help you answer your questions about DataCite metadata usage.

Facets

A facet is a metadata element, usually from a controlled list, that provides counts of records in a query result with particular values for the metadata element. Figure 1 shows how two facets are used in DataCite to provide information about search results. The “Registration Year” facet shows how many results of this “Landslide” query were registered in each year between 2013 and 2022, i.e. 1436 registered in 2021, and the “Resource Types” facet shows that the majority of the resources found in the search (3770) were datasets. Scrolling down in the search result shows the organizations with identifiers that created these resources. These facets provide information about the history, resource type, and repositories in the DataCite community that hold information about landslides. How can they help us understand the bigger picture of metadata usage?

DataCite search results for “Landslide” with facet results on the right.

The DataCite JSON Response includes data for 18 facets for each query done using the DataCite API. For example, this landslide query includes the data for the Resource Types facet in a data structure that looks like this for the first two rows:

"resourceTypes": [
    {
        "id": "dataset",
        "title": "Dataset",
        "count": 3770
    },
    {
        "id": "text",
        "title": "Text",
        "count": 1589
    },

Indicating that the full query result contains 3,770 Datasets and 1,589 Text resources, as shown in the sidebar of Figure 1. 

Keep in mind that these facets are calculated for every query done using the API, so, for example, a query for all resources with resourceTypeGeneral = “OutputManagementPlan” gives the facet results for all OutputManagementPlans in DataCite, i.e. when they were registered, who registered them, what affiliation identifiers they contain, and many others (there are 18 facets calculated for each query). One facet gives the counts of the resources in the ten repositories that have the most OutputManagementPlans (only the top ten values of each facet are retrieved. Fortunately, this covers almost everything in many cases and gives a useful overview in the others). For example, I used this result to find DataCite members that were using OutputManagementPlans so I could understand how they were using them.

As described above, the facet results are returned in the query as a list of dictionaries. I find it helpful to calculate a few summary statistics from this list. For example, the “clients” facet for PhysicalObjects gives us the ten repositories with the most PhysicalObjects:

"clients": [
    {"id": "fao.itpgrfa", "title": "International Treaty on Plant Genetic Resources for Food and Agriculture", "count": 1084243},
    {"id": "ipk.gbis", "title": "Genebank Information System of the IPK Gatersleben", "count": 208740},
    {"id": "inist.inra", "title": "Data INRAE", "count": 67518},
    {"id": "tcd.digcolls", "title": "Digital Collections", "count": 10299},
    {"id": "inist.humanum", "title": "NAKALA", "count": 9286},
    {"id": "ubc.oc", "title": "Open Collections", "count": 3720},
    {"id": "subgoe.vzg", "title": "Verbundzentrale des GBV", "count": 3142},
    {"id": "inist.inra", "title": "Institut national de recherche pour l’agriculture, l’alimentation et l’environnement", "count": 1435},
    {"id": "inist.ulille", "title": "Université de Lille", "count": 1200},
    {"id": "cern.zenodo", "title": "Zenodo", "count": 986}
],

 and I calculate the following statistics:

Statistic Definition Value
numberThe number of clients (up to 10)10
maxThe maximum number of resources for any client108243
commonDates, ReferencedBy, TechnicalInfo, Rightsfao.itpgrfa
totalRights URI1390569
homogeneityAn indicator of homogeneity of the list (0.1 = uniform, 1.0 = single item)78%

Table 1. Facet statistics.

Answering Questions About DataCite Metadata

So, how can these facet data be used to answer questions about DataCite metadata? One interesting question is “How do DataCite metadata evolve?”. An interesting test case for this question occurred with the introduction of Version 4.4 of the DataCite Schema during early 2021, just about a year ago. That version of the schema included thirteen new resource types in the resourceTypeGeneral codelist, a required field (Figure 2, from Beyond data: sharing related research outputs to make data reusable).

History of DataCite resource types. Note large increase during 2021.

The “registered” facet data for these resource types, shown in Table 2, reveal several interesting patterns:

  1. The new resource types have been used over 1.3 million times (sum of NumberOfRecords).

  2. Most of the types were assigned to items registered over ten years (registered_number = 10), indicating that repositories updated previously registered DOIs with new types (an important prerequisite for metadata evolution).

  3. Preprint is by far the most used of the new types, accounting for 71% of the items (green cell in NumberOfRecords and registered_total).

  4. The vast majority of preprint DOIs (929,124) were registered during 2022 (registered_max and registered).

  5. Repositories have already registered more items with five new types (ConferencePaper, Dissertation, JournalArticle, Preprint, Standard) during 2022 than during any other year (registered_common).

DataCite Facet Summary

DataCite Facet Summary

Item list: Book Report Journal Preprint Standard PeerReview BookChapter Dissertation JournalArticle ConferencePaper ConferenceProceeding ComputationalNotebook OutputManagementPlan
Facet list: registered
Id DateTime NumberOfRecords registered_number registered_max registered_common registered_total registered_HI registered
Book 20220518_10 15217 10 6295 2021 14640 43% 2022 (4268), 2021 (6295), 2020 (596), 2019 (455), 2018 (1098), 2017 (904), 2016 (623), 2015 (370), 2014 (9), 2013 (22)
BookChapter 20220518_10 10933 10 7456 2021 10933 68% 2022 (2663), 2021 (7456), 2020 (64), 2019 (31), 2018 (28), 2017 (100), 2016 (200), 2015 (97), 2012 (289), 2011 (5)
ComputationalNotebook 20220518_10 10 3 6 2021 10 60% 2022 (3), 2021 (6), 2019 (1)
ConferencePaper 20220518_10 23110 9 12559 2022 23108 54% 2022 (12559), 2021 (10174), 2020 (166), 2019 (69), 2018 (20), 2017 (36), 2016 (8), 2015 (7), 2013 (69)
ConferenceProceeding 20220518_10 854 10 288 2020 745 39% 2022 (128), 2021 (144), 2020 (288), 2019 (11), 2018 (10), 2017 (36), 2016 (89), 2015 (28), 2014 (6), 2013 (5)
Dissertation 20220518_10 62684 10 31816 2022 62673 51% 2022 (31816), 2021 (14607), 2020 (16200), 2019 (11), 2018 (10), 2017 (16), 2016 (4), 2015 (2), 2014 (5), 2013 (2)
Journal 20220518_10 372 3 293 2021 372 79% 2022 (71), 2021 (293), 2020 (8)
JournalArticle 20220518_10 179628 10 99959 2022 177300 56% 2022 (99959), 2021 (66609), 2020 (979), 2019 (574), 2018 (230), 2017 (1949), 2016 (6404), 2015 (14), 2014 (580), 2012 (2)
OutputManagementPlan 20220518_10 961 4 524 2021 961 55% 2022 (239), 2021 (524), 2020 (135), 2019 (63)
PeerReview 20220518_10 291 2 270 2021 291 93% 2022 (21), 2021 (270)
Preprint 20220518_10 930504 5 929124 2022 930493 100% 2022 (929124), 2021 (1299), 2020 (40), 2019 (17), 2018 (13)
Report 20220518_10 82283 10 10043 2021 59391 17% 2022 (5969), 2021 (10043), 2020 (4487), 2019 (5101), 2018 (5132), 2017 (6576), 2016 (8353), 2015 (5125), 2014 (3926), 2013 (4679)
Standard 20220518_10 2146 7 1337 2022 2146 62% 2022 (1337), 2021 (578), 2020 (6), 2019 (213), 2018 (3), 2017 (7), 2016 (2)

Report created 20220518_10 by retrieveDataCiteFacets from Metadata Game Changers

Table 2. Registered facet (i.e. year of registration) for resource types introduced during the last year.

 These data immediately lead to a second question: “Which repositories are using these new resource types?”. That question can be answered using the “clients” facet data shown in Table 3. It is immediately clear that arXiv.common dominates the usage of preprint in DataCite with 926,112 of 928,723 registered items. Similar behavior is not unusual. The homogeneity index (HI) column shows that a single repository is responsible for over 80% of the usage of five of the thirteen new types (Preprint, Standard, JournalArticle, ConferencePaper, and Report).

DataCite Facet Summary

DataCite Facet Summary

Item list: Book Report Journal Preprint Standard PeerReview BookChapter Dissertation JournalArticle ConferencePaper ConferenceProceeding ComputationalNotebook OutputManagementPlan
Facet list: clients
Id DateTime NumberOfRecords clients_number clients_max clients_common clients_total clients_HI clients
Book 20220518_10 15217 10 6433 tib.tib 14786 44% TIB Hannover (6433), Zenodo (5539), peDOCS (2065), Università degli Studi di Napoli Federico II (229), Presses Universitaires Savoie Mont Blanc (222), RIA-UA (100), Universidade Católica Portuguesa (71), Libera Università di Bolzano (54), Escuela Internacional de Negocios y Desarrollo Empresarial de Colombia - EIDEC (43), innsbruck university press (30)
BookChapter 20220518_10 10933 10 5017 mjvh.pedocs 10339 49% peDOCS (5017), Zenodo (3886), Alfred Wegener Institute (346), TIB Hannover (307), innsbruck university press (215), Libera Università di Bolzano (177), Universidade Católica Portuguesa (134), Polska Platforma Medyczna (125), Wydawnictwo Politechniki Łódzkiej (69), DIGITUMA (63)
ComputationalNotebook 20220518_10 10 4 4 jbru.cist 10 40% Fédération de recherche CIST - Collège international des sciences territoriales (4), Huma-Num (3), University of Luxembourg (2), DataCite (1)
ConferencePaper 20220518_10 23112 10 18951 cern.zenodo 22894 83% Zenodo (18951), German Medical Science (2839), University of New South Wales (421), TIB Hannover (201), Eurographics (144), Deutscher Verband für Materialforschung und -prüfung e.V. (126), University of Massachusetts (UMass) Amherst (71), pub H-BRS - Publikationsserver der Hochschule Bonn-Rhein-Sieg (50), Aalto University (46), TU Delft Research Repository (45)
ConferenceProceeding 20220518_10 854 10 583 tib.tib 799 73% TIB Hannover (583), Università degli Studi di Pisa (61), University of Texas Libraries (32), Leibniz Institute of Ecological Urban and Regional Development (29), Université de technologie de Compiègne (24), Instituto Politécnico do Porto (19), Data INRAE (15), UC Santa Barbara (14), Repositório Aberto da Universidade do Porto (11), Washington State University (11)
Dissertation 20220518_10 62685 10 22116 unsw.repo 61537 36% University of New South Wales (22116), Universitätsbibliothek der LMU (17934), Repositorio Institucional E-docUR (6942), Aston Publications Explorer (4804), STAX (4650), University of Massachusetts (UMass) Amherst (2720), Polska Platforma Medyczna (1156), Washington University in St. Louis Libraries (468), UKnowledge (381), TU Delft Research Repository (366)
Journal 20220518_10 372 10 109 zjrp.depp 285 38% Direction de l’évaluation; de la prospective et de la performance (109), Royal Botanic Gardens; Kew (92), Lusíada - Repositório das Universidades Lusíada (13), iMex. México Interdisciplinario / Interdisciplinary Mexico (13), Aspekty Muzyki (12), Università del Piemonte orientale “Amedeo Avogadro” (11), Universitätsbibliothek Eichstätt-Ingolstadt (9), James Cook University (9), mdwRepository (9), Université Paul-Valéry Montpellier 3 (8)
JournalArticle 20220518_10 179636 10 148742 cern.zenodo 174371 85% Zenodo (148742), peDOCS (14496), Stanford Social Innovation Review (5826), Alfred Wegener Institute (3093), innsbruck university press (649), Entrepôt pour orphelin (420), Aalto University (377), NumeRev (354), Akofena (223), unipub (191)
OutputManagementPlan 20220518_10 961 4 540 cern.zenodo 961 56% Zenodo (540), California Digital Library (418), SOCIB (2), CDI B2Share (1)
PeerReview 20220518_10 291 6 182 inist.humanum 291 63% Huma-Num (182), NAKALA (100), Polskie Towarzystwo Logopedyczne (6), bonndoc (1), Tampa Repository (1), HEE Journal - The Journal of Health; Environment; and Education (1)
Preprint 20220518_10 930504 10 927777 arxiv.content 930490 100% arXiv (927777), Zenodo (2608), NAKALA (46), Otto-von-Guericke-Universität Magdeburg (15), University of Massachusetts (UMass) Amherst (15), Oroboros Instruments (10), ISCPSI - Instituto Superior de Ciências Policiais e Segurança Interna (6), Institut national d'études démographiques (5), Materials Data Repository (4), Washington University in St. Louis Libraries (4)
Report 20220518_10 82283 10 66681 tib.tib 80814 83% TIB Hannover (66681), Zenodo (7895), Alfred Wegener Institute (2568), CL Technical Reports (964), Libra (902), University of New South Wales (587), National Research Council Canada (516), Open Society Foundations (282), RIVM - Rijksinstituut voor Volksgezondheid en Milieu (210), IOC of UNESCO (Intergovernmental Oceanographic Commission) (209)
Standard 20220518_10 2146 6 1848 univie.valep 2146 86% VALEP (1848), National Research Council Canada (294), Royal Botanic Gardens; Kew (1), NASA Space Physics Data Facility (SPDF) (1), Institut national de recherche pour l’agriculture; l’alimentation et l’environnement (1), Central Lancashire Online Knowledge (1)

Report created 20220518_10 by retrieveDataCiteFacets from Metadata Game Changers

Table 3. Clients facet (i.e. repositories using resource type) for resource types introduced during the last year.

Conclusion and Tools

These examples demonstrate several of the multitude of interesting questions that can be addressed using DataCite facet data available with any DataCite query. Together, the DataCite vocabularies for resoureType, relationType, and contributorType include over eighty items. Finding examples of how these items are currently used in DataCite metadata is an important step towards identifying good practices that can be shared across the community as a step towards consistent use and understanding of metadata.

A tool for retrieving these data is available at https://github.com/Metadata-Game-Changers/DataCiteFacets. The documentation there describes how you can retrieve data for any and all facets and for all DataCite resourceTypes, relationTypes, and contributor types as well as for creator affiliation strings. This is the first open source software released by Metadata Game Changers. I hope you find that software helpful in answering your own questions about DataCite metadata usage!