A PID Feast for Research – PIDapalooza 2021

We just finished 24 hours of PIDapalooza  last week with talks and demos and an amazing amount of information presented by PID experts and users all over the world. It was a truly international meeting happening all over the world with presentations and discussions in many languages.

One presentation was titled “A PID Feast for Research” by Nelli Taller, Dr. Anette Cordts, and Marleen Burger from the PID and metadata services Department of the German National Library of Science and Technology (TIB). This presentation combined two of my favorite things: PIDs and food. It included some results from an assessment of metadata in 144 repositories from the TIB DataCite Consortium that I did with the authors. The food is universal, but the slides are written in German, so I summarize the results here.

Project

The metadata assessment explored several research questions:

·      Which metadata fields are used frequently?

·      How FAIR are the metadata for a specific DOI?

·      Where is there a need for improvement, information and support?

·      What are best practical examples?

We explored these questions using data (random samples of up to 300 records) collected during November 2020 for 144 repositories. Seventy-two of these had fewer than 300 records so the samples included the entire repository. We examined the metadata to determine the frequency of occurrence of essential and supporting elements for the four FAIR use cases.

Findability

Figure 1 shows the completeness (%) of these repositories with respect to essential (red) and supporting (blue) elements for Findability. The points on the right axis are repositories with 300 or more records. The mandatory DataCite fields are included as essential elements here and make up 40% of the essential recommendation. These elements are included in essentially all repositories so there is a “floor” for the essential elements at 40%.

Both of these sets show rather uniform distributions across a range of values and each set includes several repositories that stand out from the norm. For example, there are a few repositories with up to ~170 records with Findable Essential completeness above 70% and three large repositories above 65%. Identifying these bright spots is an important result of the assessment.

Figure 1. Completeness for metadata elements in the Findable Essential (red) and support (blue) recommendations as a function of number of records in each repository.

The pattern of metadata element occurrence in one of the large repositories that stands out in both sets is shown in Figure 2. Metadata elements are shown around the outside of the circle and their occurrence rates are shown along the radials of the circle with 0% in the center and 100% on the outside. The total % in each case is shown in the title of each plot.

Figure 2. Completeness for metadata elements in the Findable Essential (left) and Support (right) recommendations in a bright spot repository, a good example for others in the collection. Total completeness is shown in the titles of each plot.

The Findable Essential results (in the left frame) show that the mandatory elements and the abstract element are complete (along the outside of the circle) in this repository and that three other fields (Funder Project Identifier, Keyword, and Project Funder) are included in over 80% of the records in the repository. The Findable Support results show that nine elements are included in the repository and that five of them are present in over 90% of the records.

The overall scores for this repository (65% and 61%) along the right axis in Figure 1 show that this repository is a bright spot for both element sets and, therefore, it is a great example for other repository that are working on improving their metadata for Findability.

Figure 3 shows results similar to Figure 1 for essential and support elements for the Accessibility, Interoperability, and Reusability use cases. The DataCite metadata schema is generally focused on F, so there are fewer elements in the A, I, and R use case and they are combined in one set. These use cases include just one mandatory field (Resource URL in the Essential set), so there is a floor of ~6% on that set. These sets reflect more focused use cases, e.g., specific kinds of resource relationships, and the results show many opportunities for improving metadata completeness.

Figure 3. Completeness for metadata elements in the AIR Essential (red) and Support (blue) recommendations as a function of number of records in each repository.

There are several small and large repositories with completeness above 30% in both of these sets. The repository with the most complete metadata in the Findable Essential set (41%) has nearly 100 records. The assessment data for this repository is shown in Figure 4. Six of the seven elements included in this set are complete for all records in the repository, indicating great consistency across the collection. Three of these elements, DocumentedBy, HasMetadata, and Methods, are very rare in DataCite metadata. They provide important links to more complete metadata for these datasets and to descriptions of the processing done to create the products.

Figure 4. Completeness of metadata elements in the AIR Essential Recommendation for a bright spot repository. Note the consistency across this repository, i.e., six of the seven elements are complete across all records.

The bright spot in the AIR Support group, the blue point near the left side of Figure 3 with a completeness of 40%, has just one record (a great start). It includes complete metadata about the identifier for the dataset contact and the URI of the rights associated with the dataset (Figure 5).

Figure 5. Completeness of metadata elements in the AIR Essential Recommendation for a bright spot repository. Note the consistency across this repository, i.e., six of the seven elements are complete across all records.

All collections of repositories, like those included in DataCite consortiums like TIB, include bright spots like those shown above that can serve as examples for other repositories in the group. An overall metadata assessment across the collection helps identify these. They are important in developing guidance for other repositories in the collection.

At the same time, the assessment provides an overall picture of the entire collection that can be used to help identify overall patterns and opportunities for short-term wins supporting an overall metadata improvement effort. Figure 6 shows the overall average for all repositories (144) in this study.

Figure 6. Average completeness across all repositories for metadata elements in all four recommendations.

The data clearly show the dominance of mandatory fields in these metadata, six in the upper left and one in the lower left. Many repositories only contain these elements as they are all that is required for the most common use case for DataCite: getting a DOI for some object. They also reflect the related general focus of the DataCite schema on discovery, identification and citation. Other fields common in the Findable Essential set (abstract, keyword, and keyword vocabulary) are important for supporting text searches for DataCite Resources. Improving overall FAIRness of DataCite metadata depends on many recommended and optional elements. These are much less common across all repositories.

A second important role of DataCite is connecting resources across the research ecosystem. As noted in many other PIDapalooza talks, identifiers, i.e., ORCIDS, RORs, funder IDs, etc., are critical for making unambiguous connections. Identifier metadata elements other than resource identifiers, are generally included in the support sets and, in general, they remain rare in these repositories and in all DataCite repositories. Finding these identifiers and adding them to the metadata is a critical on-going task required to construct the PID Graph and increase its utility across the entire network.

Conclusions

The metadata assessment provides insights into metadata of TIB consortium members and identifies good examples for future improvements. We concluded that:

·      There are already some good examples with a view to comprehensive and consistent metadata.

·      The overall average of the repositories examined reflects the focus on findability and mandatory fields.

·      With regard to the numerous metadata fields assigned to the FAIR categories, it should be noted that not every field is equally useful for every resource type and every subject area (e.g. geographical extent of interest for archaeological sites).

·      In particular with regard to good accessibility, interoperability and reusability, there is still a lot of potential for optimization of the metadata.

·      DOIs in Related Identifiers and PIDs such as ORCID and ROR are still rarely used. More frequent use would have a positive effect on discoverability, interoperability and reusability.

Overall, we are still looking for the recipe for metadata success!

If you are interested in finding good examples in your DataCite metadata, please contact us at Metadata Game Changers.