DataCite Project Use Cases
/Ted Habermann, Metadata Game Changers
Repository managers have been using DataCite metadata to describe projects since at least 2017. When I took a look at how they were doing this during July 2024, finding projects was a bit tricky because it could only be done by searching the free-text resourceType element for “project”. It turned out that resourceTypes that ended with “*roject” seemed to be a reasonable sample and that was the sample used in that work.
It has now been a year since DataCite added “project” to the resourceTypeGeneral vocabulary, making projects a first class citizen in DataCite metadata and making it possible to unambiguously identify them in the API by adding resource-type-id=project to the query. Many projects created before this addition have “Project” in the free text resourceType element and either “Other” or “Text” in resourceTypeGeneral. These are also retrieved using this API query, allowing projects to be found before repositories update their metadata to reflect the new resourceTypeGeneral.
Repositories already using DataCite for projects make up a corpus that can be used to help understand project use cases and how metadata are being used to address them. In this blog we address three use cases that came up in community discussions: Project Teams, Project Items, and Project Relations. We first develop an understanding of how these use cases are currently being addressed and then we introduce a tool for exploring specifics of how specific repositories are using DataCite to describe projects and to track progress your repository is making using this new capability.
Repositories With Projects
The first step in the analysis is to find the repositories that are already taking advantage of the project resource type in DataCite metadata. These can be identified using the API query: https://api.datacite.org/dois?resource-type-id=project which returns just over 100,000 records from sixty-eight different repositories. The repositories with more than ten project records are listed in Table 1. Forty-nine other repositories currently have less than ten project records. The Open Science Framework at the Center For Open Science has the most project records (104,976) while sixty-seven other repositories have a total of 1187 project records.
Table 1. Repositories with more than 10 project metadata records and count of projects.
Project Team Use Case
The DataCite metadata schema includes people/organizations as creators and contributors. Creator is a required element, so it occurs in all records. Contributors can have one of twenty-three roles. Several of these, projectLeader, projectManager, and projectMember, are specific to projects. Others can be used for projects and other resource types.
Table 2 shows the contributor roles that occur in DataCite projects. The two most common roles are related to infrastructure and DOIs instead of to people and organizations. The most common role, HostingInstitution, is the “organisation allowing the resource to be available on the internet”. Most of the HostingInstitutions in these records are for projects from the Center for Open Science.
Table 2. Contributor roles in DataCite projects.
A RegistrationAgency is an institution/organisation officially appointed by a Registration Authority to handle specific tasks within a defined area of responsibility. DataCite is a Registration Agency for the International DOI Foundation (IDF) and, as such, is responsible for minting DOIs for the vast majority of DataCite repositories. The most common registration agency after DataCite is the Smithsonian Institution, referenced from the Chandra (X-ray Observatory) Data Archive. The Registration Agency in these project metadata records is the Australian Research Data Commons (ARDC) that currently mints Research Activity IDs (RAiDs) for several Australian organizations. Many RAiD metadata elements are compatible with the DataCite schema and they are in metadata registered as a DataCite DOI through the ARDC/DataCite partnership.
Seven hundred and forty-six of the contributor types in these metadata are related to people of organizations in one of eleven roles with DataManager, Researcher, RelatedPerson, Sponsor, and ProjectMember occurring over 100 times. Three hundred and fifty-six contributors have the type “Other”, indicating contributions not covered by the current vocabulary.
Related Research Objects Use Cases
Connecting related research outputs and other kinds of objects is another important use case for project metadata. The DataCite metadata schema identifies related objects using relatedIdentifiers for objects with identifiers and relatedItems for objects that do not have identifiers. More detail is provided using one of thirty-eight relationTypes and the resource type of the related item is indicated using the DataCite resourceTypeGeneral vocabulary.
Nineteen of sixty-seven (28%) repositories with projects include some related identifiers in their metadata. Table 3 shows the relations and resourceTypes used. The types are optional and “None” indicates they are missing.
Table 3. Related identifier relations and types for projects in 18 repositories.
The DataCite relation types are generally “two-sided”, describing relations between the source resource (A) and the related resource (B). The most common existing relation type for projects, HasPart, means that the project A includes part B. It has been used to identify many different types of objects that are parts of projects. The most common is Physical Objects with over 500 included in a single project. These are samples collected as part of the ARMS for Biodiversity Baselines in Polynesie Francaise project at the Tetiaroa Society Ecostation. Related identifiers in DataCite can be one of many types. These physicalObjects are identified by Archival Resource Keys (ARKs).
The reciprocal relation, IsPartOf, is used with the resource type Project to define a hierarchy of projects and sub-projects, i.e. project A is part of project B. This relation type occurs 275 times. In 99 cases projects are shown to be part of journal articles. It is not clear what this relation means in practice.
The versioning relation types, HasVersion and IsVersionOf, are also quite common, but do not include types for the related items. In all, eight relation types are used to describe connections between multiple projects.
The counts in Table 2 show that The Open Science Framework (OSF) has far more project records than the rest of DataCite combined, so overall counts of metadata element values reflect behavior of OSF rather than general behavior across multiple repositories. The number of repositories that are using these relations also provide insights. Table 4 shows the number of repositories using each relation type.
Table 4. Number of repositories using relation types in project metadata.
Metadata Game Changers Project Use Cases
The DataCite metadata schema includes over sixty elements that can be used to address many metadata use cases beyond the identify and cite use case DataCite was initially designed for. Four of these use cases: FAIR Text, FAIR Identifiers, FAIR Connections, and FAIR Contacts, have been used to provide insights into metadata completeness for over 3,000 DataCite repositories and to identify repository with exceptional metadata for these use cases (bright spots). This approach serves the goal of improving metadata FAIRness as defined by these use cases.
Using DataCite metadata for projects is still relatively new and the community is still exploring how to do it effectively. The differences in metadata content illustrated above reflects this experimentation. The approach used for quantitatively characterizing metadata FAIRness can be used to understand how repositories are using project metadata.
The first step in this process is to define the metadata elements that might be used to address particular use cases. The data described above provides an empirical foundation for three use cases:
· Project Teams includes creators, contributors, and funders with identifiers and specific contributor types that are observed in existing project metadata (Table 2). Temporal coverage is also included here.
· Project Items includes types of resources that have been connected to existing projects (Table 3).
· Project Relations are the relation types that have been used to connect items to existing projects (Table 3).
The specific documentation concepts in each use case are listed in Table 5. These concepts exist in many metadata dialects and are mapped to specific metadata elements for analysis.
Table 5. Documentation concepts included in three project use cases. These concepts all map to DataCite metadata elements for analysis.
These use cases differ from the FAIR Use Cases because of differences between projects across multiple organizations. For example, Metadata Game Changers has thirteen projects and 85% of those have related Text items while only 15% have related Journal Articles. This reflects the fact that we generally focus on blogs to describe and share our project work rather than journal articles. It is interesting to note that our projects are the only ones across all of DataCite that include related ComputationalNotebooks, another way that we share capabilities with our community.
Keeping this in mind, repositories with high numbers in these use cases are good examples of the project description capabilities that the DataCite metadata schema offers. It is not surprising, therefore, that the highest utilization of the schema comes from a project record developed by DataCite for the FAIR Workflows Project.
Figure 1 shows the Project Team use case for the thirteen Metadata Game Changer Projects. Most of the right side of the plot shows people or organizations as Authors or Contributors to our projects. It indicates that 100% of our projects have team members with identifiers (ORCIDs), affiliations, and affiliation identifiers (RORs). Near the bottom of the Figure we can see that all of our projects have funders identified, also with identifiers. The rest of the Figure shows team member roles identified by contributor types. We have contacts (on the right), project leaders, members, researchers, and sponsors. Finally, we have described the temporal extents of many of our projects using the “Coverage” date type.
Figure 1. The Project Items use case for the Metadata Game Changers Projects demonstrates the capabilities of the DataCite metadata Schema for tracking the kinds of items that are connected to projects.
Figure 2 shows the Project Items use case for the Metadata Game Changers projects. As mentioned earlier, most of or project outputs (85%) are shared as blog posts with Text resource types, although there are also a few journal articles and audiovisuals, i.e. recordings of talks. We also have recently developed and shared computationalNotebooks (see below) for a few projects.
Figure 1. The Project Items use case for the Metadata Game Changers Projects demonstrates the capabilities of the DataCite metadata Schema for tracking the kinds of items that are connected to projects.
Figure 3. shows the Project Relations use case for the Metadata Game Changers projects. Most of our related items have very complete metadata shown in the upper left part of the plot, i.e. identifiers with types, resource item types, and relation types, we currently use the simple approach of HasPart for most of the relations although one of our projects, an ORCID project with NCAR, has multiple versions.
Figure 3. The Project Relations use case for the Metadata Game Changers Projects demonstrates the capabilities of the DataCite metadata Schema for tracking the relation types for items that are connected to projects.
Your Project Metadata
These Figures show many current capabilities for describing projects using the DataCite Metadata Schema and the status of the Metadata Game Changers project metadata with respect to those capabilities. If you are a DataCite member thinking about using DataCite metadata to describe your projects, the use cases show over 70 existing metadata elements that can help you achieve those goals.
If you are one of the 68 DataCite repositories that already have project metadata, you can use this notebook to explore your existing projects to identify successes and to snapshot the current state before making improvements. Once those improvements are made, comparing subsequent plots helps demonstrate your utilization of existing capabilities and increased return on your DataCite investment.
This approach to exploring project metadata is exploratory at present and we hope you find it helpful. Suggestions for improvements and questions aboput appying this approach as an aid to developing project metadata are welcome.
