Improving Domain Repository Connectivity: Person or Organization?

Cite this blog as Habermann, T. (2021). Improving Domain Repository Connectivity: Person or Organization? Front Matter. https://doi.org/10.59350/yajj6-y6687

Most current metadata standards recognize that people and organizations play similar roles in the creation and management of datasets and other research objects. This dichotomy has been managed with the introduction of the concept of ‘party’ which, for example, could be a person, organization, or position in the ISO TC211 metadata standards for geographic data. Each of the different types of parties have different properties, for examples, organizations can include people or positions, but people and positions cannot include organizations.

In the DataCite metadata schema, the dichotomy is managed by soft-typing the creator and contributor objects, i.e. including the nameType property that is ‘Personal’ for names of people and ‘Organizational’ for names of organizations. If the nameType property is not provided, the default value of ‘Personal’ is used.

When humans are reading metadata this default value is not a problem as humans can tell the difference between “Metadata Game Changers” and “Ted Habermann” regardless of the value of nameType. When machines are reading the metadata, it can cause some problems, one being the identifier type appropriate for the party. If the party is a person, ORCID is the first place to search, if it is an organization, ROR or GRID are more appropriate.

The UNAVCO Community

This series of blog posts explores the hypothesis that domain repositories are great places to work on improving connectivity because they build strong communities of people and organizations that contribute and use data from the repository many times. In the UNAVCO case, the importance of the community is reflected in the observation that ‘UNAVCO Community’ and ‘Community, UNAVCO’ are by far the most common creator names in the UNAVCO DataCite metadata, occurring 1471 times. In other words, they occur in over 27% of the DOIs, outnumbering the other major contributors by over 1000 occurrences.

The UNAVCO Community is clearly an important contributor to the repository and their contributions need to be reflected in the connections that make up the PID Graph. It seems reasonable to identify the community using the ROR for UNAVCO itself, as the community is an inseparable part of the organization. Of course, this has a major effect on the connectivity of the repository. Figure 1 shows the progression of connectivity through the various stages of this work with green being complete connectivity (identifiers for all creators), yellow being partial connectivity (some identifiers), and red being missing (no identifiers).

The top bar shows the situation after identifiers were added for the UNAVCO community. Adding these identifiers resulted in a five-fold increase in the number of DOIs with identifiers for all creators (green) and a decrease of 27% in the number of DOIs with no identifiers (red). Note that the number of DOIs with partial connectivity did not change very much, indicating that the UNAVCO Community was the only creator on most of the datasets where it is listed as a creator. When we added the identifier, the connectivity for those datasets went from missing to complete.  

Figure 1. Connectivity for parties in UNAVCO DataCite metadata through time (increasing upward). Note the large improvement resulting in adding the identifier for the UNAVCO Community.

Figure 1. Connectivity for parties in UNAVCO DataCite metadata through time (increasing upward). Note the large improvement resulting in adding the identifier for the UNAVCO Community.

Defensive Metadata

This blog began discussing how people and organizations are differentiated in DataCite metadata using the nameType property and pointed out that ‘Personal’ is the default value for this property. This can result in organization names being misidentified as personal if metadata creation processes do not differentiate between people and organizations. In fact, this is the case in the UNAVCO metadata, i.e. UNAVCO Community is written without a nameType and, therefore, identified as a personal name:

"creators": [
{
"name": "UNAVCO Community",
       "affiliation": []
       }
]

As mentioned above, this can cause problems in searches for identifiers.

There are at least two ways to avoid this problems. First, the code that reads the metadata can search multiple identifier services, i.e. ORCID and ROR, for each name and record the type of the identifiers found. Second, the metadata creator can provide identifiers two ways: as a nameIdentifier and as an affiliationIdentifier. Using this approach users will find the identifier either way they search.

In this case, the metadata looks like:

{
    "name": "UNAVCO Community",
    "affiliation": [
        {
            "affiliationIdentifier": "https://ror.org/02n9tn974",
            "affiliationIdentifierScheme": "ROR",
            "affiliation": "UNAVCO Community"
        }
    ],
    "contributorType": "creator",
    "nameType": "Organizational",
    "nameIdentifiers": [
        {
            "nameIdentifier": "https://ror.org/02n9tn974",
            "nameIdentifierScheme": "ROR"
        }
    ]
}

I think of this approach as defensive metadata – accepting redundant information in the metadata to make sure users and tools find the information they are looking for regardless of where they look. The redundance seems like a small price to pay for making life easier for a variety of users.

Conclusion

At this point in this project, all possible information has been extracted from the existing UNAVCO DataCite metadata. We increased the portion of datasets with identifiers for all people from 2% to 31% and the portion of datasets with identifiers for organizations from 0% to 25% and the confidence in most of the assertions we made about connections to identifiers is very high. The next step involves searching the UNAVCO community publications for identifiers, affiliations, and RORs. Stay tuned!