DataCite Subject Metadata
/Cite this blog as Habermann, T. (2020). DataCite Subject Metadata. Front Matter. https://doi.org/10.59350/5e0yr-xsh50
Martin Fenner from DataCite recently described the benefits of some standardization of sources for vocabularies used for three DataCite metadata elements: language, rights, and subjects. All of these elements:
can play important roles in the dataset discovery and selection processes at DataCite
are implemented using shared vocabularies that are identified by associated Scheme elements, e.g. subjectScheme, and
are optional.
Metrics are an important part of any improvement process as they provide a mechanism for identifying successes and quantitative measurements of progress that can help motivate improvements across a community. Establishing a starting point, i.e. a baseline, is a critical first step in the process. In the case of metadata improvement, a baseline can be created empirically by examining the current state of the elements that are the target of the improvements.
I start this process here with the DataCite subject element because this is the most common of the three and is important in finding datasets through text searches and classifying datasets. The input data were100 random records from 546 DataCite clients (selected using the DataCite DOI API with the random parameter = true). For clients that have less than 100 records this selects all metadata. The total sample size is 22,500 records including 26,188 subjects. As small sample but perhaps useful as a starting point.
The DataCite subject element is a container that includes four elements: subject, subjectScheme, schemeURI, and valueURI. I am focused here on the last three as the subject elements themselves are an interesting but different can of worms. What is the current state of these elements?
SubjectScheme
The subjectScheme element is a free-text name for the subject vocabulary. The sample included 104 values for subjectScheme that occurred 4719 times in 113/546 collections (~21%). Values that occurred more than one hundred times are listed in Table 1. The first observation is that many of these names are acronyms which can limit their utility to users unfamiliar with a particular domain. For example, as an Earth scientist, I “know” that GCMD is the NASA Global Change Master Directory, but I have no idea what ddc, LCSH, or MeSH are. In contrast to seven schemes that occur more than one hundred times, sixty-one occur ten or less times.
This is a diverse set of vocabularies and the standardization that Martin proposes would clearly be helpful. The vocabulary he suggested occurs as OECD 26 times and as OECD FOS 2007 76 times.
Table 1. SubjectSchemes that occur 100 or more time in the sample metadata
subjectScheme
count
GCMD
999
ddc
572
LCSH
442
keyword
405
MeSH
268
classification
149
dewey
138
LCCN
129
NARCIS-classification
119
SchemeURI
The DataCite schemeUri is defined as “The URI of the subject identifier scheme”. As such it should be helpful and perhaps more unique than the name of the scheme. The sample includes 332 values for schemeUri that occur 2356 times in 71/546 collections (13%). Only four of these occur more than 100 times (Table 2). The first two of these resolved as is and the third one resolved when the spelling was corrected, http instead of htp.
Table 2. Subject schemeURIs that occur 100 or more time in the sample metadata.
subjectScheme
count
http://www.nlm.nih.gov/mesh/
268
http://www.narcis.nl/classification
119
http://dewey.info/ (does not resolve)
115
htp://id.loc.gov (misspelled)
102
The schemeURIs are also very diverse with 272 / 332 occurring five or less times. Inspection suggests that this is because these schemaURIs are valueURIs instead of SchemeURIs, that is, they include what appear to be identifiers for items in vocabularies, e.g. http://purl.obolibrary.org/obo/MOD_00693.
valueURI
The DataCite metadata schema defines valueURI as “The URI of the subject term”, that is, it identifies a term in a vocabulary rather than the entire vocabulary. ValueURIs are the links in linked-data. These are the least common subject elements in this sample with 82 values occurring 462 times in 30/546 collections (5%). As described above, these URIs can be recognized because they include something that looks like an item id, e.g. the Q259642 in https://www.wikidata.org/wiki/Q259642 is an identifier for the word quark. The most common valueURIs are listed in Table 3 along with the terms they identify (shown in parentheses). Note that the most common URI did not resolve and the last in the list is a document rather than a term definition.
Table 3. Common valueURIs from the sample metadata
ValueURI
count
http://www.narcis.nl/classfication/D37000 (not found)
65
https://id.loc.gov/authorities/subjects/sh85043932 (English Poetry)
58
https://id.loc.gov/authorities/subjects/sh00006266 (Irish Authors)
58
http://d-nb.info/gnd/4320709-1 (branch?)
30
http://vocab.getty.edu/aat/300375632 (Tetraclinis articulata (species))
30
http://vocab.getty.edu/aat/300053170 (computer graphics)
30
http://www.oecd.org/science/inno/38235147.pdf (document)
29
Conclusion
A preliminary look at subject metadata from the DataCite repository suggests that, in addition to standardizing the vocabularies used for subjects, many other aspects of these metadata could be improved. First, nearly 80% of the DataCite collections do not include subject information and over 87% of the collections do not include meaningful subject schema URIs. The subject scheme names include many acronyms that limit the audience that can understand them, and common schemeURIs and valueURIs do not resolve or are misspelled. Equally important, many schemeURIs are actually valueURIs, i.e. the correct content is in the incorrect element.
These observations suggest several checks or metrics that might be helpful for creating a baseline for measuring progress in improving these metadata. Beyond simple measures of completeness, checks for acronyms in names and unresolved links could be implemented retrospectively or on metadata ingest or creation. Identifying and fixing these problems requires a concerted combination of automated checking and on-going collaboration between DataCite and their community of metadata providers.