Metadata Game Changers
  • Home
  • Offerings Capabilities Our Team Contact
  • Software
  • Metadata Game
  • Blog
Metadata Game Changers
  • Home/
  • About Us/
    • Offerings
    • Capabilities
    • Our Team
    • Contact
  • Software/
  • Metadata Game/
  • Blog/
dataCiteModel_V6.jpg
Metadata Game Changers

Blog

Exploring metadata, communities, and new ideas.

Metadata Game Changers
  • Home/
  • About Us/
    • Offerings
    • Capabilities
    • Our Team
    • Contact
  • Software/
  • Metadata Game/
  • Blog/
July 14, 2020

DataCite Subject Metadata

July 14, 2020/ Ted Habermann

Cite this blog as Habermann, T. (2020). DataCite Subject Metadata. Front Matter. https://doi.org/10.59350/5e0yr-xsh50

Martin Fenner from DataCite recently described the benefits of some standardization of sources for vocabularies used for three DataCite metadata elements: language, rights, and subjects. All of these elements:

  • can play important roles in the dataset discovery and selection processes at DataCite

  • are implemented using shared vocabularies that are identified by associated Scheme elements, e.g. subjectScheme, and

  • are optional.

Metrics are an important part of any improvement process as they provide a mechanism for identifying successes and quantitative measurements of progress that can help motivate improvements across a community. Establishing a starting point, i.e. a baseline, is a critical first step in the process. In the case of metadata improvement, a baseline can be created empirically by examining the current state of the elements that are the target of the improvements.

I start this process here with the DataCite subject element because this is the most common of the three and is important in finding datasets through text searches and classifying datasets. The input data were100 random records from 546 DataCite clients (selected using the DataCite DOI API with the random parameter = true). For clients that have less than 100 records this selects all metadata. The total sample size is 22,500 records including 26,188 subjects. As small sample but perhaps useful as a starting point.

The DataCite subject element is a container that includes four elements: subject, subjectScheme, schemeURI, and valueURI. I am focused here on the last three as the subject elements themselves are an interesting but different can of worms. What is the current state of these elements?

SubjectScheme

The subjectScheme element is a free-text name for the subject vocabulary. The sample included 104 values for subjectScheme that occurred 4719 times in 113/546 collections (~21%). Values that occurred more than one hundred times are listed in Table 1. The first observation is that many of these names are acronyms which can limit their utility to users unfamiliar with a particular domain. For example, as an Earth scientist, I “know” that GCMD is the NASA Global Change Master Directory, but I have no idea what ddc, LCSH, or MeSH are. In contrast to seven schemes that occur more than one hundred times, sixty-one occur ten or less times.

This is a diverse set of vocabularies and the standardization that Martin proposes would clearly be helpful. The vocabulary he suggested occurs as OECD 26 times and as OECD FOS 2007 76 times.

Table 1. SubjectSchemes that occur 100 or more time in the sample metadata

subjectScheme

count

GCMD

999

ddc

572

LCSH

442

keyword

405

MeSH

268

classification

149

dewey

138

LCCN

129

NARCIS-classification

119

SchemeURI

The DataCite schemeUri is defined as “The URI of the subject identifier scheme”. As such it should be helpful and perhaps more unique than the name of the scheme. The sample includes 332 values for schemeUri that occur 2356 times in 71/546 collections (13%). Only four of these occur more than 100 times (Table 2). The first two of these resolved as is and the third one resolved when the spelling was corrected, http instead of htp.

Table 2. Subject schemeURIs that occur 100 or more time in the sample metadata.

subjectScheme

count

http://www.nlm.nih.gov/mesh/

268

http://www.narcis.nl/classification

119

http://dewey.info/ (does not resolve)

115

htp://id.loc.gov (misspelled)

102

The schemeURIs are also very diverse with 272 / 332 occurring five or less times. Inspection suggests that this is because these schemaURIs are valueURIs instead of SchemeURIs, that is, they include what appear to be identifiers for items in vocabularies, e.g. http://purl.obolibrary.org/obo/MOD_00693.

valueURI

The DataCite metadata schema defines valueURI as “The URI of the subject term”, that is, it identifies a term in a vocabulary rather than the entire vocabulary. ValueURIs are the links in linked-data. These are the least common subject elements in this sample with 82 values occurring 462 times in 30/546 collections (5%). As described above, these URIs can be recognized because they include something that looks like an item id, e.g. the Q259642 in https://www.wikidata.org/wiki/Q259642 is an identifier for the word quark. The most common valueURIs are listed in Table 3 along with the terms they identify (shown in parentheses). Note that the most common URI did not resolve and the last in the list is a document rather than a term definition. 

Table 3. Common valueURIs from the sample metadata

ValueURI

count

http://www.narcis.nl/classfication/D37000 (not found)

65

https://id.loc.gov/authorities/subjects/sh85043932 (English Poetry)

58

https://id.loc.gov/authorities/subjects/sh00006266 (Irish Authors)

58

http://d-nb.info/gnd/4320709-1 (branch?)

30

http://vocab.getty.edu/aat/300375632 (Tetraclinis articulata (species))

30

http://vocab.getty.edu/aat/300053170 (computer graphics)

30

http://www.oecd.org/science/inno/38235147.pdf (document)

29

Conclusion

A preliminary look at subject metadata from the DataCite repository suggests that, in addition to standardizing the vocabularies used for subjects, many other aspects of these metadata could be improved. First, nearly 80% of the DataCite collections do not include subject information and over 87% of the collections do not include meaningful subject schema URIs. The subject scheme names include many acronyms that limit the audience that can understand them, and common schemeURIs and valueURIs do not resolve or are misspelled. Equally important, many schemeURIs are actually valueURIs, i.e. the correct content is in the incorrect element.

These observations suggest several checks or metrics that might be helpful for creating a baseline for measuring progress in improving these metadata. Beyond simple measures of completeness, checks for acronyms in names and unresolved links could be implemented retrospectively or on metadata ingest or creation. Identifying and fixing these problems requires a concerted combination of automated checking and on-going collaboration between DataCite and their community of metadata providers.

 

July 14, 2020/ Ted Habermann/ Comment
DataCite Metadata
dataCite, keywords

Ted Habermann

  • Minimum Metadata
  • Sometimes a Name is Not Enough - ...
  • Home/
  • About Us/
    • Offerings
    • Capabilities
    • Our Team
    • Contact
  • Software/
  • Metadata Game/
  • Blog/

Metadata Game Changers

I have worked in scientific data management for many years and enjoy working with organizations and communities that share data and knowledge. I am fluent in metadata standards and dialects used in scientific data management and publishing.

Tell us what you think!

We are constantly working to help you change your metadata game. If you have any questions, suggestions, or crazy ideas, please send contact us or connect with us through the details below.

Ted Habermann
ted@metadatagamechangers.com
ORCID | LinkedIn | Twitter

Erin Robinson
erin@metadatagamechangers.com
ORCID | LinkedIn | Twitter

or use this form.

Search the site:

Subscribe

Sign up with your email address to receive news and updates.

We respect your privacy.

Thank you!

Powered by Squarespace.