Metadata Archeology: Hunting Affiliations and RORs in DataCite Metadata
/Cite this blog as Metadata Archeology: Hunting Affiliations and RORs in DataCite Metadata. Front Matter. https://doi.org/10.59350/jmewf-dsf80
In my last blog I introduced Metadata Archeology with a description of digging around in Crossref metadata for affiliations associated with authors of work published in the Dryad Data Repository. I showed how the Crossref Participation Reports could be used for an initial survey of Crossref metadata like satellite images are being used in the GlobalXplorer project to find ancient sites in Peru.
Many people and organizations are already involved in affiliation archeology. Experiences from the American Physical Society, NASA, Dryad, and many others all indicate that it is a tricky business, or maybe a swamp. DataCite has rich potential for affiliation data. How can we survey that resource and begin to understand that potential?
DataCite Affiliations
The DataCite metadata model includes rich information about over 18 million items with DOIs. The schema includes at least four elements that describe organizations: publisher, creator, contributor, and fundingReference. Each of these elements has interesting characteristics:
The publisher element is required and, therefore, occurs in the vast majority of DataCite records. This guarantees lots of data.
The creator element is also required, but the affiliation sub-property is optional. Thus, creators exist in the vast majority of records, but how many of those have affiliations?
The contributor element is used to give credit to people or organizations that contribute to a research result in many different ways. In this case the element is optional as well as the affiliation information. How does that affect the provision of affiliations?
The funder element is optional, and it can include an identifier with a type. It has been in the DataCite schema since late 2016.
Do these differences affect the nature / quality of the affiliation data in the DataCite records? This is an initial look at those questions.
The Data
The DataCite repository currently has metadata for ~18,000,000 registered items. That collection is huge and heterogeneous. Way too big for an initial survey! Can we break it into subsets that make sense?
DataCite has ~140 Providers that have between just 1 and almost 2 million records. Quite a variation in size! These Providers each manage DOIs for some number of Data Centers. Nearly half of the DataCite providers include only one Data Center. I expect these Data Center collections to be (at least somewhat) homogeneous, so they are the initial set of collections for this survey. All together there are 1432 Data Centers in the DataCite repository. Some are big and some are small.
I used the very useful DataCite sample capability that allows a random sample to be selected from any DataCite query to select a random sample of 100 items from each Data Center. This seems like a very small sample size, but the median Data Center collection size is 113, so, for 687 of the 1411 collections with findable content, a sample of 100 is the complete collection. In other cases, it is a small sample. Hopefully representative enough for an initial survey.
All together my sample included 94,032 records from 1411 Data Center collections.
Publishers
The vast majority of the collections have one publisher/record so I ended up with just over 93,000 publisher elements and just over 5,000 unique publishers. The distribution of number of publishers / Data Center is shown in Figure 1 (note that the bin size along the y-axis is non-linear). The most common case (689) is one publisher / Data Center with almost 1300 Data Centers having ten or fewer publishers
Large numbers of Data Centers with small numbers of publishers is very exciting. It means that the process of transitioning to using persistent identifiers for publishers could be straight forward for many DataCite Data Centers. Almost 700 out of ~1400 Data Centers (49%) only need know one publisher identifier. In fact, the picture is even better because many of the Data Centers actually have multiple representations of the same publisher in their current metadata. In many cases these multiple representations boil down to simple misspellings, acronyms or differences in the addresses of organizations included in the affiliation strings. For example, the four publishers from the zbmed.ifado Data Center (shown below) are obviously all the same organization (https://ror.org/05cj29x94) with small variations in the name:
IfADo - Leibniz Research Centre for Working Environment and Human Factor
IfADo - Leibniz Research Centre for Working Environment and Human Factors
IfADo - Leibniz Research Centre for Working Environment and Human Factors, Dortmund
Leibniz Research Centre for Working Environment and Human Factors, Dortmund
The long-term goal of this work is assigning identifiers to organizations in the scholarly communications community. These data suggest that identifying these inconsistencies and cleaning them up in the process of assigning identifiers will immediately improve connectivity for over 90% of the DataCite Data Centers. In the case shown above, a ROR already exists for Leibniz Research Centre for Working Environment and Human Factors and that is the only ROR that zbmed.ifado needs to know to assign RORs to their 477 records.
It is well known that the process of extracting meaningful organization names from affiliation information is a challenge. That remains true here. In the first pass I was able to identify RORs for roughly 25% of the publishers in this sample. This number will increase with a closer look at these strings and increased adoption of RORs across the scholarly communication community.
Creators and Contributors
As expected, the case for creator and contributor affiliations is different because this is optional information rather than required. Less than half of the Data Centers (591 / 1432) provide any affiliation information for creators or contributors. Figure 2 shows the number of creators / contributors per Data Center (note non-linear Y axis). Again, the most commonly occurring number of affiliations / Data Center is one for 108/591 (~18%) of the Data Centers. Once again, this helps make it possible to jump start the adoption process with these Data Centers. As in the Publisher case, there is also inconsistency in some of the affiliations which will decrease the number of RORs required as it is cleaned up.
Funders
Acknowledging funding for research projects and results in a specific section of research articles has long been a standard part of scholarly communications. Adding Funder names into the metadata makes it possible to index and search that information and identifiers improve the consistency of the identification and search results. In addition, they improve machine readability of the data
The number of Data Centers that provide funder information remains small (178/1432) three years after this content was introduced. Figure 3 shows the distribution of the number of Funders / Data Center (note non-linear Y axis). As in the other cases, many of the Data Centers that have funder names have a small number of Funders. Eighty-five Data Centers currently have five or less Funders. Also, as in the previous cases, there are multiple representations of many of the funders that, once cleaned up, will help increase the adoption of the current Crossref identifiers, or others.
As mentioned above, the fundingReference section of the DataCite metadata includes an element for the funding identifier and 46 of the Data Centers provide this information. The majority of the funder ids come from the Crossref Funder Registry, with most of the remainder from GRID or ISNE.
Can You Improve Your Affiliations?
Realizing the benefits of consistent organization names and permanent identifiers will take some time and effort across the whole community. The first pass ROR identification process does slightly better with affiliations than in the publisher case with RORs being identified for 27% of the affiliation strings. Some patterns emerge that might help researchers and providers improve their affiliation names:
1. Standard Names – each organization has a standard name that is used in the ROR database along with the identifier. Look up your organization at https://ror.org and use that standard name along with the identifier of course.
2. Acronyms – many of the affiliations in the DataCite metadata are just acronyms. Unexpanded acronyms are a well-known problem in the scientific literature. Using them alone to uniquely identify organizations can easily cause ambiguity and other problems in many cases. Probably best to avoid acronyms and abbreviations all together in affiliations.
3. Addresses – all addresses are made up of many parts that can be combined or omitted in many ways. The DataCite metadata schema does not include an element that holds physical addresses so providers add parts of them onto affiliation strings. This exacerbates the recognition challenge. Probably best to avoid address information in DataCite affiliations unless it is critical.
4. University Names – University names, like addresses, can have many parts. Is it “University of X”, “X University”, or “The University of X”? Of course, the word university can be abbreviated in many ways (U, U., Univ.) and is different in different languages. Then there are state universities with campuses in many locations. Is the separator a space, a comma, a dash, or the word ‘at’. All of these are used and sometimes they are different in a single state system. Probably best to fall back on suggestion #1. See how ROR already does it and stick with that.
5. Funder Names – Stick with the Crossref Funder Registry for names of funding organizations. If this registry migrates to another identifier system (perhaps ROR), this will allow systematic migration from Crossref to the next identifier registry.
All of these suggestions are, at best, stopgap solutions. The ambiguity of these names is a principal motivation for identifiers in the first case. Fortunately, DataCite is including organizational identifiers for creators and contributors in the next release of their metadata schema (Version 4.3) which is imminent. Keep your eyes open for that announcement and integrate organizational identifiers (and others) into your DataCite metadata and tools as soon as possible.
Conclusion
The initial affiliation survey of DataCite metadata has uncovered some interesting results. First, the Data Center level seems reasonable for analysis and the DataCite API sampling capability is very useful for generating Data Center collections for analysis. Second, the number of unique publisher, creator/contributor, and funder affiliations per Data Center are generally small and, in many cases only one. This greatly simplifies the process of ROR adoption for these Data Centers as they only need a few RORs to cover all of their metadata records. Improving the consistency of the affiliation strings will also help this process along.
Removing ambiguity and ensuring credit where it is due are important goals of the ROR community and we can all contribute to achieving success. Now is the time to ROR!