How Many When (2025 Update): Dataset Bright Spot in the Driver's Seat
/Ted Habermann
Cite this blog as Habermann, T. (2025). How Many When (2025 Update): Dataset bright Spot in the Driver’s Seat. https://doi.org/10.59350/5p6kw-j0740
Understanding how repositories are using DataCite and trends in that usage over time is important for the DataCite community and provides input for the DataCite Metadata Working Group. My first contribution to this task, during 2022, introduced a tool for exploring DataCite facets and focused on Dataset and Text resources, the two most common resourceTypes at that time. I decided to do an update during 2023 and was very surprised to see the incredible emergence of samples as a new kind of resource with DataCite DOIs. In this case, samples were covered by the Physical Object resourceType which had existed for several years, but the number of samples in 2023 was over 10,000,000, blasting PhysicalObjects into the top three most common resourceTypes.
I retrieved the resourceType counts including 2024 and I was surprised again. The amazing growth of DataCite that started during 2023 continued and increased during 2024, with over 21 million resources registered (Figure 1).
Figure 1. Number of DataCite Resources and resourcetypes / year.
Like 2023, the increase during 2024 is dominated by one resource type. In this case, the number of Datasets showed an increase from ~2.7 million (a record) during 2023 to almost 14 million during 2024! Figure 2 shows the % of the yearly total for each resource type to make it easier to compare distributions rather than raw numbers through time. Datasets made up 72% of the new DOIs during 2024. This is the largest % for datasets since 2008 and it appears that this trend will continue through 2025 with Datasets currently making up 85% of the 2025 DOIs. Note also that Figure 1 suggests that 2025 will be another record year for number of resources in DataCite.
Figure 2. The % of DOIs/year for the top ten resourceTypes in DataCite. Note the huge influx of samples (grey) during 2023 and the dominance of Datasets (blue) during 2024 and 2025.
Where Did They Come From?
A small number of repositories were responsible for most of the samples registered during 2023. Was it the same during 2024?
We can easily find the top ten repositories that contributed datasets to DataCite by examining the clients facet in the query https://api.datacite.org/dois?registered=2024&resource-type-id=dataset. All of these repositories contributed over 130,000 datasets, but just one, the National Institute of Fusion Science (NIFS), was responsible for 72% of the new datasets (Figure 3). This repository currently includes over 18 million datasets and just over 10 million of these were added during 2024.
Figure 3. Distribution of datasets registered by the ten top dataset contributors during 2024.
NIFS Metadata Completeness is a Bright Spot
Many of the datasets registered by NIFS appear to be related, like multiple granules collected during experiments. Many dataset titles are shared by hundreds or thousands of datasets, and the landing pages are similar except for local identifiers, perhaps “shot” numbers. These landing pages also include links to extensive domain specific documentation that goes far beyond the DOI metadata in DataCite.
FAIRness of the dataset metadata in the top ten repositories during 2024 was measured using the techniques described for measuring FAIRness of other DataCite repositories. Figure 4 compares the completeness scores for random samples of 2,000 records from the NIFS metadata (solid line) to the range and median for samples from the nine other big contributors (shaded area). The differences are remarkable. The NIFS metadata were clearly outstanding (Figure 4) with their metadata supporting the Text and Identifier use cases having scores over 80%. The NIFS metadata are at the maximum levels in the other two use cases and have a total score of 55%. As a point of reference, this score is comparable to the scores observed for the University of Bath, which was a bright spot among nearly 400 university repositories. The NIFS repository joins the club of amazing DataCite metadata providers!
Figure 4. Completeness of metadata for four FAIRness use cases and total in ten repositories that contributed over 130,000 datasets to DataCite during 2024. The NIFS metadata in the Text and Identifier use cases are more complete than all of the other repositories and they match the maximum in the other two use cases .
Conclusion
Data from all DataCite repositories shows that the datasets were by far the most common resourceType registered during 2024 and that the majority of those new datasets were registered by one repository, the National Institute of Fusion Science (NIFS) which registered over 10 million new datasets. It will be interesting to follow up and learn more about how NIFS is creating, documenting, and utilizing DataCite DOIs.
The metadata in this repository are very complete for FAIR use cases built on 1) full text metadata elements, i.e., names and descriptions, and 2) identifiers for people, organizations, publishers, funders, and awards. The metadata are truly outstanding compared to hundreds of other DataCite repositories. The datasets appear to be data files or granules from similar experimental runs. They provide an outstanding example of scientific research organizations creating complete metadata even at very large scale.