DataCite Facets: Understanding DataCite Usage
/Cite this blog as Habermann, T. (2022). DataCite Facets: Understanding DataCite Usage. Front Matter. https://doi.org/10.59350/wwv9g-31p18
Introduction
The DataCite Metadata Schema has evolved considerably over more than a decade and now includes a variety of metadata elements, resource types, related resources, and contributor types in millions of metadata records from over 2000 repositories. The DataCite Metadata Working Group has overseen this evolution and works with DataCite members and the DataCite Board to chart the path forward for the metadata schema. Understanding how DataCite metadata is currently being used provides critical background for this group. I will describe how the DataCite metadata facets help me improve my understanding of DataCite metadata usage in this blog and introduce some software that can help you answer your questions about DataCite metadata usage.
Facets
A facet is a metadata element, usually from a controlled list, that provides counts of records in a query result with particular values for the metadata element. Figure 1 shows how two facets are used in DataCite to provide information about search results. The “Registration Year” facet shows how many results of this “Landslide” query were registered in each year between 2013 and 2022, i.e. 1436 registered in 2021, and the “Resource Types” facet shows that the majority of the resources found in the search (3770) were datasets. Scrolling down in the search result shows the organizations with identifiers that created these resources. These facets provide information about the history, resource type, and repositories in the DataCite community that hold information about landslides. How can they help us understand the bigger picture of metadata usage?
The DataCite JSON Response includes data for 18 facets for each query done using the DataCite API. For example, this landslide query includes the data for the Resource Types facet in a data structure that looks like this for the first two rows:
"resourceTypes": [ { "id": "dataset", "title": "Dataset", "count": 3770 }, { "id": "text", "title": "Text", "count": 1589 },
Indicating that the full query result contains 3,770 Datasets and 1,589 Text resources, as shown in the sidebar of Figure 1.
Keep in mind that these facets are calculated for every query done using the API, so, for example, a query for all resources with resourceTypeGeneral = “OutputManagementPlan” gives the facet results for all OutputManagementPlans in DataCite, i.e. when they were registered, who registered them, what affiliation identifiers they contain, and many others (there are 18 facets calculated for each query). One facet gives the counts of the resources in the ten repositories that have the most OutputManagementPlans (only the top ten values of each facet are retrieved. Fortunately, this covers almost everything in many cases and gives a useful overview in the others). For example, I used this result to find DataCite members that were using OutputManagementPlans so I could understand how they were using them.
As described above, the facet results are returned in the query as a list of dictionaries. I find it helpful to calculate a few summary statistics from this list. For example, the “clients” facet for PhysicalObjects gives us the ten repositories with the most PhysicalObjects:
"clients": [ {"id": "fao.itpgrfa", "title": "International Treaty on Plant Genetic Resources for Food and Agriculture", "count": 1084243}, {"id": "ipk.gbis", "title": "Genebank Information System of the IPK Gatersleben", "count": 208740}, {"id": "inist.inra", "title": "Data INRAE", "count": 67518}, {"id": "tcd.digcolls", "title": "Digital Collections", "count": 10299}, {"id": "inist.humanum", "title": "NAKALA", "count": 9286}, {"id": "ubc.oc", "title": "Open Collections", "count": 3720}, {"id": "subgoe.vzg", "title": "Verbundzentrale des GBV", "count": 3142}, {"id": "inist.inra", "title": "Institut national de recherche pour l’agriculture, l’alimentation et l’environnement", "count": 1435}, {"id": "inist.ulille", "title": "Université de Lille", "count": 1200}, {"id": "cern.zenodo", "title": "Zenodo", "count": 986} ],
and I calculate the following statistics:
Statistic | Definition | Value |
number | The number of clients (up to 10) | 10 |
max | The maximum number of resources for any client | 108243 |
common | Dates, ReferencedBy, TechnicalInfo, Rights | fao.itpgrfa |
total | Rights URI | 1390569 |
homogeneity | An indicator of homogeneity of the list (0.1 = uniform, 1.0 = single item) | 78% |
Table 1. Facet statistics.
Answering Questions About DataCite Metadata
So, how can these facet data be used to answer questions about DataCite metadata? One interesting question is “How do DataCite metadata evolve?”. An interesting test case for this question occurred with the introduction of Version 4.4 of the DataCite Schema during early 2021, just about a year ago. That version of the schema included thirteen new resource types in the resourceTypeGeneral codelist, a required field (Figure 2, from Beyond data: sharing related research outputs to make data reusable).
The “registered” facet data for these resource types, shown in Table 2, reveal several interesting patterns:
The new resource types have been used over 1.3 million times (sum of NumberOfRecords).
Most of the types were assigned to items registered over ten years (registered_number = 10), indicating that repositories updated previously registered DOIs with new types (an important prerequisite for metadata evolution).
Preprint is by far the most used of the new types, accounting for 71% of the items (green cell in NumberOfRecords and registered_total).
The vast majority of preprint DOIs (929,124) were registered during 2022 (registered_max and registered).
Repositories have already registered more items with five new types (ConferencePaper, Dissertation, JournalArticle, Preprint, Standard) during 2022 than during any other year (registered_common).
DataCite Facet Summary
Item list: Book Report Journal Preprint Standard PeerReview BookChapter Dissertation JournalArticle ConferencePaper ConferenceProceeding ComputationalNotebook OutputManagementPlanFacet list: registered
Id | DateTime | NumberOfRecords | registered_number | registered_max | registered_common | registered_total | registered_HI | registered |
---|---|---|---|---|---|---|---|---|
Book | 20220518_10 | 15217 | 10 | 6295 | 2021 | 14640 | 43% | 2022 (4268), 2021 (6295), 2020 (596), 2019 (455), 2018 (1098), 2017 (904), 2016 (623), 2015 (370), 2014 (9), 2013 (22) |
BookChapter | 20220518_10 | 10933 | 10 | 7456 | 2021 | 10933 | 68% | 2022 (2663), 2021 (7456), 2020 (64), 2019 (31), 2018 (28), 2017 (100), 2016 (200), 2015 (97), 2012 (289), 2011 (5) |
ComputationalNotebook | 20220518_10 | 10 | 3 | 6 | 2021 | 10 | 60% | 2022 (3), 2021 (6), 2019 (1) |
ConferencePaper | 20220518_10 | 23110 | 9 | 12559 | 2022 | 23108 | 54% | 2022 (12559), 2021 (10174), 2020 (166), 2019 (69), 2018 (20), 2017 (36), 2016 (8), 2015 (7), 2013 (69) |
ConferenceProceeding | 20220518_10 | 854 | 10 | 288 | 2020 | 745 | 39% | 2022 (128), 2021 (144), 2020 (288), 2019 (11), 2018 (10), 2017 (36), 2016 (89), 2015 (28), 2014 (6), 2013 (5) |
Dissertation | 20220518_10 | 62684 | 10 | 31816 | 2022 | 62673 | 51% | 2022 (31816), 2021 (14607), 2020 (16200), 2019 (11), 2018 (10), 2017 (16), 2016 (4), 2015 (2), 2014 (5), 2013 (2) |
Journal | 20220518_10 | 372 | 3 | 293 | 2021 | 372 | 79% | 2022 (71), 2021 (293), 2020 (8) |
JournalArticle | 20220518_10 | 179628 | 10 | 99959 | 2022 | 177300 | 56% | 2022 (99959), 2021 (66609), 2020 (979), 2019 (574), 2018 (230), 2017 (1949), 2016 (6404), 2015 (14), 2014 (580), 2012 (2) |
OutputManagementPlan | 20220518_10 | 961 | 4 | 524 | 2021 | 961 | 55% | 2022 (239), 2021 (524), 2020 (135), 2019 (63) |
PeerReview | 20220518_10 | 291 | 2 | 270 | 2021 | 291 | 93% | 2022 (21), 2021 (270) |
Preprint | 20220518_10 | 930504 | 5 | 929124 | 2022 | 930493 | 100% | 2022 (929124), 2021 (1299), 2020 (40), 2019 (17), 2018 (13) |
Report | 20220518_10 | 82283 | 10 | 10043 | 2021 | 59391 | 17% | 2022 (5969), 2021 (10043), 2020 (4487), 2019 (5101), 2018 (5132), 2017 (6576), 2016 (8353), 2015 (5125), 2014 (3926), 2013 (4679) |
Standard | 20220518_10 | 2146 | 7 | 1337 | 2022 | 2146 | 62% | 2022 (1337), 2021 (578), 2020 (6), 2019 (213), 2018 (3), 2017 (7), 2016 (2) |
Report created 20220518_10 by retrieveDataCiteFacets from Metadata Game Changers
Table 2. Registered facet (i.e. year of registration) for resource types introduced during the last year.
These data immediately lead to a second question: “Which repositories are using these new resource types?”. That question can be answered using the “clients” facet data shown in Table 3. It is immediately clear that arXiv.common dominates the usage of preprint in DataCite with 926,112 of 928,723 registered items. Similar behavior is not unusual. The homogeneity index (HI) column shows that a single repository is responsible for over 80% of the usage of five of the thirteen new types (Preprint, Standard, JournalArticle, ConferencePaper, and Report).
DataCite Facet Summary
Item list: Book Report Journal Preprint Standard PeerReview BookChapter Dissertation JournalArticle ConferencePaper ConferenceProceeding ComputationalNotebook OutputManagementPlanFacet list: clients
Id | DateTime | NumberOfRecords | clients_number | clients_max | clients_common | clients_total | clients_HI | clients |
---|---|---|---|---|---|---|---|---|
Book | 20220518_10 | 15217 | 10 | 6433 | tib.tib | 14786 | 44% | TIB Hannover (6433), Zenodo (5539), peDOCS (2065), Università degli Studi di Napoli Federico II (229), Presses Universitaires Savoie Mont Blanc (222), RIA-UA (100), Universidade Católica Portuguesa (71), Libera Università di Bolzano (54), Escuela Internacional de Negocios y Desarrollo Empresarial de Colombia - EIDEC (43), innsbruck university press (30) |
BookChapter | 20220518_10 | 10933 | 10 | 5017 | mjvh.pedocs | 10339 | 49% | peDOCS (5017), Zenodo (3886), Alfred Wegener Institute (346), TIB Hannover (307), innsbruck university press (215), Libera Università di Bolzano (177), Universidade Católica Portuguesa (134), Polska Platforma Medyczna (125), Wydawnictwo Politechniki Łódzkiej (69), DIGITUMA (63) |
ComputationalNotebook | 20220518_10 | 10 | 4 | 4 | jbru.cist | 10 | 40% | Fédération de recherche CIST - Collège international des sciences territoriales (4), Huma-Num (3), University of Luxembourg (2), DataCite (1) |
ConferencePaper | 20220518_10 | 23112 | 10 | 18951 | cern.zenodo | 22894 | 83% | Zenodo (18951), German Medical Science (2839), University of New South Wales (421), TIB Hannover (201), Eurographics (144), Deutscher Verband für Materialforschung und -prüfung e.V. (126), University of Massachusetts (UMass) Amherst (71), pub H-BRS - Publikationsserver der Hochschule Bonn-Rhein-Sieg (50), Aalto University (46), TU Delft Research Repository (45) |
ConferenceProceeding | 20220518_10 | 854 | 10 | 583 | tib.tib | 799 | 73% | TIB Hannover (583), Università degli Studi di Pisa (61), University of Texas Libraries (32), Leibniz Institute of Ecological Urban and Regional Development (29), Université de technologie de Compiègne (24), Instituto Politécnico do Porto (19), Data INRAE (15), UC Santa Barbara (14), Repositório Aberto da Universidade do Porto (11), Washington State University (11) |
Dissertation | 20220518_10 | 62685 | 10 | 22116 | unsw.repo | 61537 | 36% | University of New South Wales (22116), Universitätsbibliothek der LMU (17934), Repositorio Institucional E-docUR (6942), Aston Publications Explorer (4804), STAX (4650), University of Massachusetts (UMass) Amherst (2720), Polska Platforma Medyczna (1156), Washington University in St. Louis Libraries (468), UKnowledge (381), TU Delft Research Repository (366) |
Journal | 20220518_10 | 372 | 10 | 109 | zjrp.depp | 285 | 38% | Direction de l’évaluation; de la prospective et de la performance (109), Royal Botanic Gardens; Kew (92), Lusíada - Repositório das Universidades Lusíada (13), iMex. México Interdisciplinario / Interdisciplinary Mexico (13), Aspekty Muzyki (12), Università del Piemonte orientale “Amedeo Avogadro” (11), Universitätsbibliothek Eichstätt-Ingolstadt (9), James Cook University (9), mdwRepository (9), Université Paul-Valéry Montpellier 3 (8) |
JournalArticle | 20220518_10 | 179636 | 10 | 148742 | cern.zenodo | 174371 | 85% | Zenodo (148742), peDOCS (14496), Stanford Social Innovation Review (5826), Alfred Wegener Institute (3093), innsbruck university press (649), Entrepôt pour orphelin (420), Aalto University (377), NumeRev (354), Akofena (223), unipub (191) |
OutputManagementPlan | 20220518_10 | 961 | 4 | 540 | cern.zenodo | 961 | 56% | Zenodo (540), California Digital Library (418), SOCIB (2), CDI B2Share (1) |
PeerReview | 20220518_10 | 291 | 6 | 182 | inist.humanum | 291 | 63% | Huma-Num (182), NAKALA (100), Polskie Towarzystwo Logopedyczne (6), bonndoc (1), Tampa Repository (1), HEE Journal - The Journal of Health; Environment; and Education (1) |
Preprint | 20220518_10 | 930504 | 10 | 927777 | arxiv.content | 930490 | 100% | arXiv (927777), Zenodo (2608), NAKALA (46), Otto-von-Guericke-Universität Magdeburg (15), University of Massachusetts (UMass) Amherst (15), Oroboros Instruments (10), ISCPSI - Instituto Superior de Ciências Policiais e Segurança Interna (6), Institut national d'études démographiques (5), Materials Data Repository (4), Washington University in St. Louis Libraries (4) |
Report | 20220518_10 | 82283 | 10 | 66681 | tib.tib | 80814 | 83% | TIB Hannover (66681), Zenodo (7895), Alfred Wegener Institute (2568), CL Technical Reports (964), Libra (902), University of New South Wales (587), National Research Council Canada (516), Open Society Foundations (282), RIVM - Rijksinstituut voor Volksgezondheid en Milieu (210), IOC of UNESCO (Intergovernmental Oceanographic Commission) (209) |
Standard | 20220518_10 | 2146 | 6 | 1848 | univie.valep | 2146 | 86% | VALEP (1848), National Research Council Canada (294), Royal Botanic Gardens; Kew (1), NASA Space Physics Data Facility (SPDF) (1), Institut national de recherche pour l’agriculture; l’alimentation et l’environnement (1), Central Lancashire Online Knowledge (1) |
Report created 20220518_10 by retrieveDataCiteFacets from Metadata Game Changers
Table 3. Clients facet (i.e. repositories using resource type) for resource types introduced during the last year.
Conclusion and Tools
These examples demonstrate several of the multitude of interesting questions that can be addressed using DataCite facet data available with any DataCite query. Together, the DataCite vocabularies for resoureType, relationType, and contributorType include over eighty items. Finding examples of how these items are currently used in DataCite metadata is an important step towards identifying good practices that can be shared across the community as a step towards consistent use and understanding of metadata.
A tool for retrieving these data is available at https://github.com/Metadata-Game-Changers/DataCiteFacets. The documentation there describes how you can retrieve data for any and all facets and for all DataCite resourceTypes, relationTypes, and contributor types as well as for creator affiliation strings. This is the first open source software released by Metadata Game Changers. I hope you find that software helpful in answering your own questions about DataCite metadata usage!