Projects@DataCite Revisited
/Cite this blog as Habermann, T. (2024). Projects@DataCite Revisited. Front Matter. https://doi.org/10.59350/ejd15-gvd72.
Just over a year ago I explored the state of projects in DataCite metadata, motivated by a DataCite Metadata Working Group discussion about a new resourceTypeGeneral of “Project” (Habermann, 2023). The landscape has evolved during the last year with the publication of the RAiD Metadata Schema (RAiD, 2023), the emergence of a strategic partnership between DataCite and Australian Research Data Commons (DataCite and ARDC, 2024), and work on project metadata for the FAIR Island Project (Robinson et al., 2023, Stathis and Robinson, 2024) and several Metadata Game Changers projects. These developments and experiences suggested it was time to revisit projects in DataCite with the goals of understanding how the new resourceTypeGeneral might be used and formulating a baseline to measure future developments.
Finding Projects
The initial step is finding projects in DataCite. The first time around I used the most straight-forward approach I could imagine: search for “Project” in the free text resourceType element. The results were mixed. Many of the resources had resourceType = Project, which was good, but many others had descriptions that suggested that they were other resourceTypes. For example, Zenodo had almost 29,000 records with resourceType = Project Deliverable or Project milestone.
This time I used a more focused approach. First, Kelly Stathis at DataCite helped me get beyond the DataCite facet list which is limited to ten results to find ninety-six repositories that included “Project” or “project” in their resourceType elements. Examining the resourceType data for these, I found that the resourceType descriptions for most of the resources that appeared to be projects ended with the word Project (or project), so I used URLs like: https://api.datacite.org/dois?query=types.resourceType:(*roject) to focus the sample. Re-examining the results identified two repositories that had resourceTypes like “Temperature profiles from XBT probes collected during 20230618 cruise in the framework of the MACMAP project” which were clearly datasets rather than projects. This winnowing resulted in the set I used for further work, nearly 12,000 records from forty-two repositories.
This process illustrates the kind of challenges faced whenever using free text metadata fields in discovery and analysis and the primary motivation for using a standard shared list for resourceTypeGeneral in DataCite. This problem will go away when “Project” is added to the shared list for resourceTypeGeneral in the next release of the DataCite schema.
Updated Project Record Sample for Analysis
These selection shenanigans had some effect on the details of the dataset when compared to last year’s data, e.g. the second largest repository in last year’s sample (Zenodo) was dropped, and some other numbers were reduced, but the big picture looks similar. The Center for Open Science - Open Science Framework repository, cos.osf, has by far the largest number of project metadata records and the cos.osf records have resourceTypeGeneral = “Project” so they are also unambiguously projects. Some non-projects may remain in other repositories.
For COS, I randomly sampled 10,000 of 78,000+ records. For all other repositories, I included all project records for the analysis. Six repositories have 100 or more projects, and fifteen repositories have more than 10 projects (Table 1).
Repository
Records
Repository
Records
cos.osf
10,000
vcu.vculibraries
41
usc.dl
568
36
umich.library
263
32
gdcc.odum-library
251
20
unlv.ds
170
16
cdl.cdl (fairIsland)
100
15
gasu.repo
84
11
cngb.cga
60
Table 1. Repositories with more than ten project metadata records.
Documentation Concepts
As described above, there are forty-two repositories that may be exploring project metadata in DataCite. The principal question to be addressed here is: what is the content of these project records? We can address this question using documentation concepts, dialect-independent concept names that can be mapped to many metadata dialects and enable comparison of metadata content across dialects. As an example, consider that DataCite Metadata, a collection of documentation concepts, are available in multiple dialects: XML, json, schema.org, or json-ld. All of these dialects include the title of the resource, a documentation concept, but those titles are in different places in each dialect structure. To check whether or not the concept “Resource Title” exists in a record, a map of the concepts must exist for that dialect. See Habermann, 2024a for mapping of the concepts used here to the DataCite Metadata JSON Schema.
Figure 1 shows the number of documentation concepts (bars) and the number of records (bar labels) in each of these repositories. There is no clear relationship between the number of project records in a repository and the number of concepts included in the metadata. Two repositories share the largest number of concepts (30): the Center for Open Science:Open Science Framework (cos.osf) and the DataCite:FAIR Workflows (datacite.eqhrgl). These two repositories have the largest and one of the smallest numbers of project records of all repositories in this study.
DataCite and Metadata Game Changers have been working on project metadata in three repositories to test project metadata concepts (see this Commons page for ideas). In addition to developing a single record for the Implementing FAIR Workflows Project, which has 30 concepts, two other repositories are also exploring the functionality of DataCite project metadata and include more than 24 concepts. One is the FAIR Island repository described by Stathis and Robinson, 2024 (cdl.fairIsland made up of FAIR Island Projects from the cdl.cdl repository) and one is the new repository of Metadata Game Changers projects (sjyq.oozvia).
Figure 2 shows all documentation concepts found in the project metadata dataset and the number of repositories that contain each. At the top of the plot are the mandatory concepts snd the Resource Type that occur in all 42 repositories. Identifiers for authors, affiliations, awards and funders are shown as yellow. Connection concepts, e.g. HasPart or DocumentedBy, are shown in green. Most recommended or optional elements occur in less than half (21) of the repositories.
Connections
One of the goals of project metadata is to provide connections between people and organizations that participate in the projects, funders that support them, and resources they create or build on. These connections are implemented in DataCite metadata using persistent identifiers (PIDs) and related identifiers. Figure 3 shows a graph of nine projects included in the Metadata Game Changers portfolio from the last several years which connect 24 people (green), 18 organizations, and almost 60 resources using 9 different roles or relationTypes. This graph was created from project metadata records available in the DataCite Commons.
Figure 3 shows connections between people, organizations, and funders through identifiers:
People (Resource Author Identifier, ORCID, green),
Organizations (Resource Author Affiliation Identifier, ROR, orange),
Funders (Funder Identifier, Crossref Funder ID / ROR, orange)
These identifiers all appear to varying degrees in the DataCite project metadata sample (20, 15, 5, and 6 repositories, green in Figure 2). DataCite metadata can also connect to related objects using relatedIdentifiers, identified here by documentation concepts that correspond to their relation types, e.g. the documentation concept HasPart corresponds to connections with relationType = “HasPart” in the metadata. The most common relation type, HasPart, occurred in six repositories (orange in Figure 2) while others like References were observed in less than five repositories.
Figure 4 shows the number of occurrences of relationTypes in current projects. In addition to being in the most repositories, HasPart also occurs more than any other relationType.
Over 500 occurrences of HasPart that were used for physical samples were not shown in Figure 4 so others could be seen more clearly.
In addition to relationTypes, RelatedIdentifiers in the DataCite Metadata Schema includes a resourceTypeGeneral element that gives the type of the related resource. This element is useful for understanding the kind of resources that make up the project and potential connections between them. This element is rare in the current metadata, occurring in only three repositories.
Many project metadata discussions focus attention on the importance of connections to people, organizations, funders, and other resources. We have shown here that connection capabilities exist in the DataCite schema, but these metadata generally do not take advantage of them. Half of the repositories include identifiers for people and eleven of forty-two of the repositories include relatedIdentifiers. Connections can not be made without these identifiers.
Conclusions
We are in the early days of using DataCite metadata to identify, describe and connect projects and we are exploring how this might be done effectively in the metadata and how the DataCite infrastructure can support project use cases. A small number of repositories are creating project metadata, but they may provide a helpful baseline for future work that will take place as Project becomes a standard resourceTypeGeneral.
The content of these records is similar to that of most DataCite records, the mandatory elements are complete and several other descriptive elements, i.e. Abstract, Keywords) are common. These elements serve the primary DataCite use cases, identification and citation, well, but other content is needed for other expected use cases.
Many project metadata discussions focus attention on the importance of connections to people, organizations, funders, and other resources. Connection capabilities exist in the DataCite schema, but the metadata examined here generally do not take advantage of them. Half of the repositories include identifiers for people and only eleven of forty-two of the repositories include relatedIdentifiers.
Describing projects with structured metadata of any kind requires a considerable evolution of how we think about metadata capabilities and use cases. DataCite and Metadata Game Changers are partnering on a project funded by the Richard Lounsbery Foundation titled “Advancing Open Science Infrastructure in Natural Sciences Through Collaborative Innovation Service Development” which is aimed at exploring how several new DataCite resource types: Instrument and Project, might be used across multiple domains. This project will build on community experience and input during two virtual community dialog sessions this fall. All interested parties are welcome. Please let us know, join our dialogs (soon to be announced), comment below, or send me an email, if you are interested in providing ideas and input.
Things Have Changed!
One of the things that stood out when I first wrote this blog is that the nascent project records we have been developing for Metadata Game Changer projects were the only ones that included relatedIdentifiers with “IsCitedBy” relationTypes, a small blip near the bottom of Figure 2. We chose IsCitedBy because we hoped that articles and other resourceTypes that were related to the projects would cite them in their references and those citations would define the connections.
The well-known challenges citing datasets, software, and other resourceTypes, clearly indicate that this is a long, aspirational road and the data clearly show that others have chosen HasPart for defining these relationships, so we updated all of our relationships from IsCitedBy to HasPart. Our thinking is that if a resource is created by the project, typically by the project owner and likely funded by the same award, HasPart is the best choice. If research objects created outside of the project cite it, then IsCitedBy still makes sense. If the community agrees, perhaps this becomes part of the guidance to be developed for projects in DataCite.
Also, the Community Dialog on project metadata in DataCite has now been scheduled for October 15, 22:00 GMT. Registration is now available.
Acknowledgements
This work is funded by a Richard Lounsbery Foundation (http://dx.doi.org/10.13039/100017651) project titled “Advancing Open Science Infrastructure in Natural Sciences Through Collaborative Innovation Service Development”. Thanks to Erin Robinson, Maria Gould, and Kelly Stathis for help with the data and the writing!
Data Availability
The data used in this study are available at https://doi.org/10.5281/zenodo.12774010
References
DataCite and RAiD, 2024, DataCite & ARDC Announce Partnership to Deliver the RAiD Service, https://doi.org/10.5438/tk44-dq06.
Habermann, T. (2023). Project Metadata in DataCite. Front Matter. https://doi.org/10.59350/zwzvv-1n627.
Habermann, T., (2024a). FAIR Metadata Concepts in DataCite Metadata Schema, https://zenodo.org/records/12168626.
Habermann, T. (2024b). DataCite Project Metadata Content [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12774010
RAiD, 2023, Research Activity Identifier (RAiD) Metadata Schema, https://metadata.raid.org/en/latest/.
Robinson, E., Habermann, T., Buys, M., Chodacki, J., Davies, N., Praetzellis, M., Wimalaratne, S., 2023, Connecting Place-based Research Back to the Place: Leveraging the Global Research Infrastructure to Enable Open Science, https://agu.confex.com/agu/fm23/meetingapp.cgi/Paper/1459946
Stathis, K., and Robinson, E., 2024, DataCite Connections: A Case Study on FAIR Island Project Metadata, https://doi.org/10.5438/3qxz-sx64.