Metadata Evolution - Metadata Completeness, Agility, and Collection Size

Metadata Evolution - Metadata Completeness, Agility, and Collection Size

I recently introduced a simple metric for measuring metadata collection completeness with respect to elements in the CrossRef Participation Reports. The suggestion of this metric immediately led to speculation about relationships between collection size and completeness. Small collections include fewer records – are they more likely to be complete? Publishers with large collections have more resources – do they have more complete metadata? Are smaller publishers more agile - can they change more?

Read More

Talking and Thinking About Metadata

The idea that the language we use to talk about things shapes the way we think or can think about those things has been around since the 1800’s and even has a name, the Sapir–Whorf hypothesis, proposed during 1954. It was Whorf who said, “Language is not simply a reporting device for experience but a defining framework for it.” Last year Lera Boroditsky discussed a similar idea from the stage at TEDWomen with some nice examples and data from multiple languages and cultures. I have been thinking and writing about a universal documentation language for some time and bring together a couple of those ideas here.  

Some metadata terms emerged from my metadata evaluation and guidance work with many partners.  I described the concept of “metadata dialects” and suggested that many metadata standards are more like dialects of a universal documentation language then they are like separate languages. Some have questioned whether a universal “documentation language” really exists. I admit that it is really a concept that I hope exists rather than a real language described in an unabridged dictionary somewhere. 

More recently, I introduced this dialect nomenclature to the Metadata 2020 community of metadata experts that advocate richer, connected, reusable, and open metadata for all research outputs. The terms are slowly creeping into some Metadata 2020 discussions, hopefully helping to build and cross bridges between different communities that are committed to better metadata in all contexts.

Documentation or Metadata?

Many datasets and products are documented using approaches and tools developed by data collectors to support their analysis and understanding. This documentation exists in notebooks, scientific papers, web pages, user guides, word processing documents, spreadsheets, data dictionaries, PDF’s, databases, custom binary and ASCII formats, and almost any other conceivable form, each with associated storage and preservation strategies. This custom, often unstructured, approach may work well for independent investigators or in the confines of a particular laboratory or community, but it makes it difficult for users outside of these small groups to discover, use, and understand the data without consulting with its creators.

Metadata are standard and structured documentation.

Metadata are standard and structured documentation.

Metadata, in contrast to documentation, helps address discovery, use, and understanding by providing well-defined, structured content. This makes it possible for users to access and quickly understand many aspects of datasets that they have not collected. It also makes it possible to integrate information into discovery and analysis tools, and to provide consistent references from the metadata to external documentation.

Metadata standards provide standard element names and associated structures that can describe a wide variety of digital resources. The definitions and domain values are intended to be sufficiently generic to satisfy the metadata needs of various disciplines. These standards also include references to external documentation and well-defined mechanisms for adding structured information to address specific community needs.

Another important difference between documentation and metadata is the target audience. Documentation is targeted at humans and it relies heavily on our capability to make sense out of a variety of unstructured information. Metadata, on the other hand, is typically targeted at applications. Many of these applications facilitate searching metadata and displaying it in a way that facilitates data discovery by humans. As tools mature and, more importantly, the breadth of existing metadata increases, we will see more and more applications creating and using metadata to facilitate more sophisticated metadata and data driven discovery, comparisons between multiple datasets, and other analyses.

Of course, the audience is also very important when we create metadata. Humans like descriptions that help them understand the resources being described and citations to more, likely unstructured, information. Applications are generally much more demanding when it comes to consistency and completeness. It is important to consider both audiences when creating and improving metadata.

Note added: It is interesting to see that the word “documentation” has a much longer history than the word “metadata”. Metadata is really the new kid on the block.