Metadata Evolution - Metadata Completeness, Agility, and Collection Size

I recently introduced a simple metric for measuring metadata collection completeness with respect to elements in the CrossRef Participation Reports. The suggestion of this metric immediately led to speculation about relationships between collection size and completeness. Small collections include fewer records – are they more likely to be complete? Publishers with large collections have more resources – do they have more complete metadata? Are smaller publishers more agile - can they change more? Lots of interesting questions.

The original work was based on a sample of 1684 CrossRef members. I increased the size of the sample to 4292 for this work. This is still less than half of the CrossRef membership, but hopefully it is a large enough sample to identify general patterns.

Figure 1 shows completeness as a function of size (number of items) for current collections of journal articles. The largest collection includes over 1.6 million items while the median collection size is 128. I trimmed the X-axis at 4000 so that some detail was preserved in the bulk of the collections. Fifty-four collections with more than 4000 items are shown on the right edge of the plot.

Figure 1. Collection completeness vs. size for journal articles from the current time period. Large collections (size > 4000) are plotted on the right axes for clarity.

Figure 1. Collection completeness vs. size for journal articles from the current time period. Large collections (size > 4000) are plotted on the right axes for clarity.

The completeness of the large collections varies from 0 to 9.08 which is close to the range of completeness (0 to 10.1) for all sizes, and there are 65 smaller collections (< 4000) that are more complete than the most complete large collection. There is no clear completeness trend as a function of collection size.

There are several features of the completeness data shown in Figure 1 that are worth noting can be clarified by examining the distribution of completeness directly (Figure 2). First, there are a large number of collections that include very few (<1) of the metadata elements included in the CrossRef Participation Reports. This probably reflects the fact that the included elements go beyond the minimum metadata required by CrossRef and the general behavior that many metadata providers are content with minimal content.

Figure 2. Distribution of number of collections with completeness. Note the large number of collections without any of the metadata elements included in the CrossRef Participation Reports (completeness = 0), the paucity of collections with completen…

Figure 2. Distribution of number of collections with completeness. Note the large number of collections without any of the metadata elements included in the CrossRef Participation Reports (completeness = 0), the paucity of collections with completeness between ~6 and ~8, and the concentration of collections with integer values of completeness.

Second, there is a clear paucity of collections with completeness between six and eight suggesting that there may be two sets of elements that tend to occur together. One set has less than six items and the other has four. Collections that include the first set have completeness <= 6 while those that include both sets have completeness > 8. Future work will focus on identifying whether these groups actually exist and which elements they include.

Finally, there is a clear tendency for completeness to cluster around integer values, i.e. the bins that include integer values are all higher than adjacent bins that do not. This reflects the occurrence of “homogeneous” collections (Habermann, 2017) where all existing metadata elements are included in all records. These collections have completeness metrics that equal the number of elements they include. Future work will identify and discuss these interesting collections. 

The change in completeness between the backfile and the current time periods is the available measurement that may provide a proxy for “agility”. Figure 3 shows this change as a function of collection size on the same scale as Figure 1.

Figure 3. Current completeness - backfile completeness as a function of collection size (blue) and running average of change for 54 points. The mean change is 0.8 and the majority of collections are improving. The running average suggests a small in…

Figure 3. Current completeness - backfile completeness as a function of collection size (blue) and running average of change for 54 points. The mean change is 0.8 and the majority of collections are improving. The running average suggests a small increase in change with size.

The red line on this Figure shows a running average of completeness with a sample size of fifty-four, the number of large collections. This data reflects the diversity of the smaller collections and shows a slight increase in change with collection size. This might suggest that the larger collections are actually more “agile” or that increased resources slightly favor more improvement. However, the trend is not very strong.

Conclusions

I started with some questions that emerged shortly after I proposed a quantitative measure of metadata completeness with respect to metadata elements measured by the CrossRef Participation Reports. Examining completeness and change as a function of collection size sheds some light on answers to these questions. First, the data do not show a clear relationship between completeness and collections size (Figure 1), but there is a very weak trend of increasing change with collections size (Figure 2). Figure 1 also shows several other features: 1) a large number of collections are very incomplete with respect to the metadata elements included in the reports, 2) a paucity of collections with completeness between ~6 and ~8, and 3) a clear tendency for completeness to cluster around integer values. These last two features will be explored further in future blogs.