Instruments@DataCite

Ted Habermann, Metadata Game Changers

Version 4.5 of the DataCite Metadata Schema, released during January 2024, includes several changes supporting the identification and description of instruments. These changes were made in concert with the RDA PIDINST Working Group as a step towards implementing their recommended instrument metadata content in DataCite. Several DataCite members were describing instruments in DataCite metadata before this capability was introduced and others are beginning to do it now. These existing efforts can inform the development of community conventions and help the broader community understand how to use instrument metadata effectively. This blog post explores current usage as initial input to that process.

DataCite Schema Changes

The DataCite metadata schema includes a shared vocabulary (resourceTypeGeneral) that defines thirty resource types that can be described in DataCite metadata (Habermann, 2023). Version 4.5 of the DataCite metadata schema added the term “Instrument” to this list, allowing unambiguous identification of instruments and related metadata. 

Several organizations were using DataCite to describe instruments prior to these changes without a standard mechanism for identifying the described resources as instruments. In some cases, they combined resourceTypeGeneral = PhysicalObject or Other with the free-text field resourceType = “Instrument” to identify these records. In others, the resourceType included longer descriptions for instruments that can be difficult to discover unambiguously. The resourceTypeGeneral = instrument solves this discovery problem (Figure 1).

Figure 1. Two schema changes were made to support instruments: a new resourceTypeGeneral = Instrument and new relationTypes Collects and IsCollectedBy.

The resourceTypeGeneral vocabulary is also used to define types of resources connected using related identifiers. Adding Instrument to this vocabulary makes it possible to make well-defined connections between instruments, datasets, and other resources. These connections also include relationTypes from the relationType vocabulary. The terms IsCollectedBy and Collects were added to this vocabulary specifically for use with instruments (Figure 1).

Early Adopters

Early adopters of the resourceTypeGeneral = Instrument currently have 78 records (Table 1). Several of these repositories participated in the early PIDINST discussions and several have joined the effort since DataCite included this resourceTypeGeneral.

Repository ID

Name

Count

tib.hzb

Helmholtz-Zentrum Berlin für Materialien und Energie GmbH

24

psu.dmr-first

2D Crystal Consortium (2DCC) - Division of Materials Research (DMR) - FIRST

22

todn.hcsvci

Technical University of Denmark - Energy Innovation Systems

12

pawsey.repo

Pawsey Supercomputing Centre

7

cos.osf

Open Science Framework

6

pqip.devices

Helmholtz-Zentrum Dresden-Rossendorf e.V. -DEVICES

3

upenn.repo

Univ. of Pennsylvania Repository

2

tib.iow

Leibniz-Institut fuer Ostseeforschung Warnemuende

1

uq.repo

The University of Queensland

6

Table 1. Early adopters of resourceTypeGeneral = Instrument.

Content

The content of the instrument metadata records from these repositories is shown in Figure 2. The required elements (orange) support identification and citation, the primary use cases of the DataCite schema. There are 78 records, so all required elements occur at least 78 times.

Figure 2. Content of 78 Instrument records from sources listed in Table 1. Mandatory elements are orange, funding metadata are light blue, descriptive metadata are green, identifiers are red, and other elements are blue. The concept names in this Figure map to 1) specific DataCite elements, e.g., Resource Title, 2) to relatedIdentifier.relationTypes, e.g., DocumentedBy,  3) to contributorTypes, e.g., Distributor, or to descriptionTypes, e.g. Abstract. See Habermann, 2024 for the details of this mapping.

Resource Identifiers

There are several mandatory fields that occur more than 78 times, indicating that there are records that include multiple values for these elements. One of these is Resource Title which reflects the inclusion of titles in multiple languages or acronyms as AlternativeTitles. Another is ResourceIdentifier which reflects the existence of multiple identifiers with different types for some instruments. The most common additional identifier is “serial number” which occurs 16 times.

Descriptions

The DataCite metadata schema includes several metadata elements, shown as light green in Figure 2, that provide descriptive metadata across a spectrum of detail (Figure 3). In the case of instruments, the identifier (DOI) provides the ability to connect to/from the specific instrument and resourceTypeGeneral is a very general type (Instrument). ResourceType provides more specific information as free-text which can be used in many ways. The next level of detail is the instrument title, i.e. the name of the specific instrument with the acronym included as an alternateTitle (if available or commonly used). The most detailed information in the metadata record is provided by descriptions with types Abstract or TechnicalInfo. Forty-one (52%) of the existing instrument records include all four of these descriptive elements.

Figure 3. DataCite metadata elements that provide resource descriptions at several levels of detail (top row) and connections to identified resources, contributors, and funders (bottom row) for instruments and other resources.

The instrument metadata from these repositories include some elements that are unexpected for physical instruments, e.g. Resource Format with values like “text/csv” and Resource Size with values like “1.3MB”.  Exploring these descriptive elements shows that two of the repositories in Table 1 are using the type “Instrument” to identify things like surveys, questionnaires, interviews, tests, checklists, or observation forms. This usage is consistent with the DataCite definition of Instrument but more detail is required to clearly identify the semantics of Instrument. It is interesting to note that “StudyRegistration” was added to the resourceTypeGeneral list in V4.5 and that “Project” will be added in the next version. These additions may provide a more appropriate solution in four of the records in Table 2 which currently use these terms as resourceTypes.

resourceType

Title

Project

BENDEP-SRQ-GV

ProjectComponent

RebeL - Codebook/scales manual, dataset, summary chart of RebeL [Skalenhandbuch, Datensatz, Über-blicksgrafik]

ProjectComponent

Data collection form

Project

Methodology to analyse the divergent thinking egg task

Pre-registration

Digital Interface Patterning for Detecting Illegitimate Publications (DIP-DIP) scale

Pre-registration

Mental Health & Illness Education in Paramedicine: A Scoping Review

template

Data Dictionary Blank Template

Assessment Check-list/p>

FAIR Assessment Checklist for Data Repositories

Table 2. resourceTypes and titles from records describing research instruments rather than physical instruments.

Keywords are another important element for describing instruments and have the advantage that they can be selected from specialized instrument keyword vocabularies or ontologies. The current metadata includes many keywords (more than one / record), but keyword vocabularies remain rare as of yet. The most common one in these metadata is The Bepress Digital Commons Three-Tiered Taxonomy now part of the Elsevier Digital Commons.

Connections

Connections to and from instrument metadata are critical for integrating instruments into the broader research infrastructure and for providing context for understanding the instruments and, equally important, how they are used. The DataCite metadata schema provides several ways to connect instruments to people, organizations, funders, and other resources, also shown in Figure 3.

Funder References

The introduction of the FunderReference element into Version 4.0 of the DataCite schema during 2016 expanded funder description capabilities to include funderName, funderIdentifier, awardNumber, awardURI and awardTitle. Funder metadata are still rare in these instrument records (Figure 2), with Project Funder and Award Number occurring in 28 records (36%), Award Title occurring in 23 records (29%), and Funder Identifiers / Award URIs occurring in <10% of the records. In some cases, the same Funder is listed with several different names/acronyms and without funder identifiers. This makes it difficult to recognize these funders unambiguously and to ensure consistent acknowledgement.

People and Organizations

DataCite creators are important because they are listed in recommended citations of DataCite resources. Resource Author Identifiers occur for most creators in these metadata. The DataCite schema includes nameTypes of either “personal” or “organizational” which allows identification of organizations that are responsible for these instruments. Most of the creators are organizations, identified by RORs, while individuals are identified by ORCIDs. Resource Author Affiliation Identifiers (RORs) occur in only five records. They are more common as identifiers for contributors with contributorType = “Hosting Institution” which occurs in 65 records.

Helmholtz-Zentrum Berlin für Materialien und Energie GmbH

The Helmholtz-Zentrum Berlin für Materialien und Energie GmbH (HZB) helped lead the RDA effort to develop instrument identifiers and to apply these identifiers to support FAIR data systems. HZB has currently documented twenty-four instruments in DataCite metadata. Together these provide real-world examples of many of the documentation concepts described here and shown in Figure 3. 

Figure 4 shows HZB instruments from the BESSY Synchrotron Light Source, identified and connected using the identifiers from the metadata. The orange node in the center of the graph is the organization HZB, identified by a ROR (https://ror.org/02aj13c28). It is connected to nine instruments (blue nodes identified by DOIs) as an author, necessary for HZB to appear in the citation, and as a contributor with contributorType = HostingInstitution.

Instruments that have been used together in particular experiments are grouped into three groups by relatedIdentifiers with relationType = References. One instrument, with DOI = “ni000022” appears without any connections. Three of these instruments, in two experiments, were funded by the organization with Crossref Funder ID = 501100002347 (red). Finally, instruments are linked to articles in journals (green) or proceedings (pink) using relatedIdentifiers with relationType = “IsDescribedBy”.

Figure 4. Connected instruments, organizations and publications for the BESSY II light source at HZB. Identifiers are used to label nodes and show just suffixes for display simplicity.

Figure 4 is useful for seeing relationships and groups, and the identifiers alone make it possible for tools to identify items and relationships unambiguously. They are not as useful for human users that are not intimately familiar with this research environment and the HZB identifiers.

Figure 5 takes a step from identifiers towards more detail (Figure 3) by labeling instruments with resourceTypes from the metadata. HZB has used resourceTypes to enable this level of detail, i.e. light sources and associated experiment stations, as only nine of their twenty-four instruments are associated with BESSY II. The other HZB instruments, not shown here, are associated with the BER II reactor. Each group of connected instruments represent a different configuration of the experiments.

Figure 5. Connected instruments, organizations and publications for the BESSY beamline at HZB. In this case, identifiers are replaced by resourceTypes for instruments (blue). HZB uses resourceTypes to identify light sources and associated experiment stations for the instruments, BESSY II in this case.

Figure 6 takes one more step along the detail spectrum by using titles to label the instrument nodes. These are specific to each instrument, and, unlike the identifiers, they are human readable. Organization names and article titles are also human readable, so, while Figure 4 provides a machine-readable picture of the BESSY beamline, Figure 6 provides a complimentary human-readable version of the picture.

Figure 6.  Connected instruments, organizations and publications for the BESSY beamline at HZB. In this case, titles/names are used for organizations, articles and instruments, allowing specific instruments and other resources to be identified by human users.

Conclusion

Several DataCite members are beginning to use existing DataCite metadata elements to take advantage of the resourceTypeGeneral=Instrument capability introduced during January 2024. Currently 78 instrument records have been created by nine organizations.

Like many metadata records in DataCite, these records are focused on the identification and citation use cases with mandatory DataCite fields and less usage of fields that support other use cases, like connectivity of people, organizations, funders, and other research objects. Re-curation of these existing records can increase the metadata completeness along with the return on investment of metadata creation and maintenance.

Existing metadata elements support instrument descriptions at several different levels along a detail spectrum: resourceTypeGeneral -> resourceType -> title -> description -> relatedIdentifiers. Organizations that are just starting along the path to instrument metadata in DataCite can take advantage of examples from the Helmholtz-Zentrum Berlin für Materialien und Energie GmbH that demonstrate this spectrum quite well. These examples may also provide a helpful starting point for dialogs on community conventions for instrument metadata in DataCite.

Join a Community Dialog!

DataCite, in partnership with Metadata Game Changers, is excited to announce a virtual community dialogue session to engage the broader life science and astronomy communities in co-developing metadata enhancements supporting the development and use of persistent identifiers (PIDs) for instruments. This dialogue represents the first phase of a larger project recently launched by DataCite. 

We invite metadata creators and users to join us and actively advance infrastructure solutions to identify, describe, discover, and track the impact of instruments across domain communities. A primary focus of the dialogue and the broader initiative, generously funded by the Richard Lounsbery Foundation, is the community collaboration and the co-development of collaboratively designed prototype solutions tailored to the needs of the life sciences and astronomy communities. 

To this end, we would like to invite you to register today for this virtual dialogue. 

🔬Persistent Identifiers for Instruments Community Dialogue:

Information:

References                                                                  

Günther, G., Bär, M., Greve, N., Krahl, R., Kubin, M., Mannix, O., Smith, W., Vadilonga, S., & Wilks, R. (2022). FAIR Meets EMIL: Principles in Practice. Proceedings of the 18th International Conference on Accelerator and Large Experimental Physics Control Systems, ICALEPCS2021, China. https://doi.org/10.18429/JACOW-ICALEPCS2021-WEBL05

Habermann, T. (2023). How Many When (Update). Front Matter. https://doi.org/10.59350/w1em6-nn888 

Habermann, T. (2024). FAIR Metadata Concepts in DataCite Metadata Schema [Data set], Zenodo. https://doi.org/10.5281/zenodo.12168626

RDA Persistent Identification of Instruments WG, https://www.rd-alliance.org/groups/persistent-identification-instruments-wg/