Improving Domain Repository Connectivity: Article Metadata Archeology

Cite this blog as Habermann, T. (2021). Improving Domain Repository Connectivity: Article Metadata Archeology. Front Matter. https://doi.org/10.59350/s5cfh-dx817

In previous blogs we used DataCite metadata for UNAVCO to demonstrate how identifiers could be found and spread through the metadata collection to improve connectivity for people and organizations. The community built around UNAVCO over time was a critical part of this process as community members, both individuals and organizations, make many contributions over time. We also benefited from the fact that all contributing organizations have RORs and that the UNAVCO Community itself was recognized as an important contributor to the datasets described in the metadata. 

Like many domain repositories, UNAVCO keeps track of papers that are published using data that are in the repository. Dataset DOIs and clear citation guidelines both make it easier to do this tracking. The list of UNAVCO community publications available on the website includes 1569 articles published between 2003 and 2018. This is a rich source of identifiers (ORCIDs) for community members and affiliations (leading to RORs) that were not included in the original DataCite metadata. Finding these identifiers in the papers is termed Metadata Archeology as it involves searching for exciting finds in a large information sets.

Finding Article DOIs

The first step in the process of finding identifiers within these papers is to find DOIs for the papers themselves. This was done by searching Google for the titles of the papers and searching results for pages with titles matching the titles of the papers. If these matches exist, the metadata of the page can be scraped for a meta tag with the name “citation_doi” and content which is the DOI for the paper. 

An example of this approach is illustrated in Figure 1 for the paper titled “A revised dislocation model of interseismic deformation of the Cascadia subduction zone” (https://doi.org/10.1029/2001JB001227). In this case, as in most examples, all goes well and the DOI is easily determined from the first link in the google results. When using this approach on over 1500 papers, there are inevitable hiccups and challenges, but I was able to retrieve DOIs for 1222 (78%) of the papers.

Figure 1. Searching Goole for titles and scraping a DOI from the page metadata (<meta> tags).

Figure 1. Searching Goole for titles and scraping a DOI from the page metadata (<meta> tags).

Once the DOIs were known, I used two approaches to finding ORCIDS and affiliations:

1.    search Crossref metadata

2.    search and scrape journal web pages.

The first approach is preferred because the Crossref metadata are in a standard, structured representation and retrieving ORCIDS and affiliations is straightforward. These standard metadata are an invaluable resource for aspiring metadata archeologists. In contrast, scraping journal web pages is remarkably inconsistent. Affiliations (without identifiers) are many times available in meta tags (citation_author and citation_author_institution). Unfortunately, no citation_author_identifier or citation_author_institution_identifier tags exist. ORCIDs, if available, are many times hidden in mouseovers or popups or other exotic and clever approaches that may work for humans but are difficult for machines. In any case, the vast majority of ORCIDs/Affiliations identified were from Crossref.

Connectivity

Now we have a collection of DOIs with identifiers for people and affiliations (no RORs yet) which means we can determine the connectivity of the collection. Figure 2 shows the baseline connectivity of this collection for affiliations and people using the same visualization used earlier for the DataCite metadata. A horizontal bar represents the entire collection and colored sections indicate connectivity, green for items that have complete connectivity (on the left), yellow for items that have partial connectivity (in the middle), and red for items that have no connectivity (on the right). The pictures are very different for affiliations and ORCIDs. Over 70% of the papers have affiliations for all authors (green in Figure 2) while only 2% of the papers have ORCIDs for all authors. More importantly, over 90% of the papers have no ORCIDs.  The average connectivity for ORCIDs is 4.2% while the average for affiliations is 71%. This reflects the observation that it is typical for all authors of a paper to have affiliations while only a few, typically the corresponding author, have ORCIDs.

Figure 2. Baseline connectivity of this collection for affiliations and people using the same visualization used earlier for the DataCite metadata collection.

Figure 2. Baseline connectivity of this collection for affiliations and people using the same visualization used earlier for the DataCite metadata collection.

Figure 3 shows the average connectivity for ORCIDs and Affiliations over time. It confirms the general disparity between identifiers for people and affiliations. It also indicates that the connectivity for affiliations has increased over the last several years.

Figure 3. Connectivity in the article metadata for affiliations (orange) and ORCIDs (blue).

Figure 3. Connectivity in the article metadata for affiliations (orange) and ORCIDs (blue).

It is important to note that these connectivity data are for papers rather than datasets, so the completeness of the identifiers is in the hands of the journals that publish the papers and provide the metadata for the papers to Crossref rather than the UNAVCO repository.

Conclusion

UNAVCO, like many domain repositories, tracks papers that use data from the repository. These papers are a potential source for author identifiers, affiliations, and organization identifiers. The search for these identifiers involves multiple steps, first finding DOIs for the papers using title searches, and then retrieving identifiers from Crossref or scraping journal web pages.

The data indicate that this collection of papers is a much richer source for affiliation information than for ORCIDs. Over 70% of the papers have affiliations for all authors while only 2% of the papers have ORCIDs for all authors, similar to the numbers observed for ORCIDs in the DataCite metadata.

The next step is to associate ORCIDs and affiliations with authors and to transfer discovered identifiers back to the dataset metadata in DataCite that are managed, and therefore controlled, by UNAVCO. Even though there are only a few ORCIDs in the journal metadata, the new ORCIDs identified here can apply to many datasets in the DataCit metadata (the target of the connectivity improvement process). The large number of affiliations gives us a good chance of adding new affiliations and organizational identifiers to the dataset metadata in DataCite.