Metadata Evolution - CrossRef Participation Reports
/Over the last several years CrossRef has grown into one of the largest and most significant metadata repositories in the world. It contains over 100,000,000 registered content items and offers many services to help members and other users take advantage of that content. In addition to managing registered content, CrossRef makes links between resources. It is the connective tissue for the web of scholarly communications.
Of course, CrossRef knows that complete and consistent metadata are the lifeblood of high-quality services. They recently introduced Participation Reports (with a help page) to help members and users get a handle on completeness of CrossRef metadata collections. These reports provide completeness metrics (% of records) for 10 key metadata elements for nineteen content types over three time-periods: current (past two years and year-to-date), backfile (older), and all.
With over 12,000 members with registered content, these reports are helpful for CrossRef members and the mother-lode for information about how metadata collections evolve over time. Even better, this mother-lode is available in bulk through the CrossRef API as well as the temporal snapshots created by the Participation Reports.
I used the CrossRef API to download complete participation report data for a sample of 1684 members, together providing over 6400 member/content-type/time-period combinations, each termed a metadata collection. These data were retrieved using sets of parameters shown in the last row of Table 1.
Each section of Table 1 shows the number of occurrences of each time-period (all, backfile, current) for each coverage-type in this sample. For example, the first sample, in columns 1-4, included book metadata for 28 members covering all time, 27 members covering backfile, and for 9 members covering current time. The second sample, in columns 5-8, included book metadata for 36 members covering all time, 16 members covering the backfile, and 31 members covering current time.
Coverage Type |
all |
backfile |
current |
Coverage Type | all |
backfile |
current |
book | 28 |
27 |
9 |
book | 36 |
16 |
31 |
book-chapter | 26 |
16 |
7 |
book-chapter | 16 |
2 |
14 |
book-series | 3 |
book-series | 1 |
||||
book-set | 2 |
||||||
component | 48 |
component | 16 |
||||
dataset | 74 |
9 |
9 |
dataset | 7 |
1 |
1 |
dissertation | 2 |
|
dissertation | 6 |
|||
journal | 101 |
journal | 368 |
||||
journal-article | 843 |
838 |
683 |
journal-article | 757 |
310 |
747 |
journal-issue | 47 |
40 |
15 |
journal-issue | 383 |
128 |
372 |
journal-volume | 1 |
journal-volume | 6 |
||||
monograph | 12 |
12 |
5 |
monograph | 21 |
10 |
16 |
other | 6 |
5 |
3 |
||||
posted-content | 2 |
2 |
posted-content | 3 |
2 |
3 |
|
proceedings | 6 |
6 |
1 |
proceedings | 24 |
1 |
24 |
proceedings-article | 18 |
18 |
10 |
proceedings-article | 33 |
6 |
33 |
proceedings-series | 2 |
1 |
2 |
||||
reference-book | 3 |
3 |
reference-book | 1 |
1 |
1 |
|
report | 16 |
15 |
6 |
report | 18 |
8 |
16 |
report-series | 1 |
1 |
1 |
||||
standard | 1 |
||||||
Parameters: rows = 1000, offset = 0 | Parameters: rows = 1000, offset = 8001 |
It is clear from these data that most resources in this sample are journal-articles, with hundreds of members reporting in each time-period. An analysis of metadata evolution requires samples from more than one time period, so this initial work focuses only on journal-articles and compares current and backfile metrics. This restriction yields more than 2600 collections for analysis.
The member data retrieved using the API includes eleven individual metrics, each with a possible range of zero to one. The names of these elements are slightly different than the field labels on the Participation Reports and both sets of names are listed in Table 2.
API Elements |
Participation Reports |
API Elements |
Participation Reports |
Abstracts |
Abstracts |
Orcids |
ORCID IDs |
Affiliations |
References |
References |
|
Award Numbers |
Funding award numbers |
Resource Links |
Text mining URLs |
Funders |
Funder registry IDs |
Similarity Checking |
Similarity checking URLs |
Licenses |
License URLs |
Update Policies |
Crossmark enabled |
Open References |
Open References |
An unweighted sum of these numbers yields a single metric for each time-period with a range between zero and eleven. This metric was calculated for all of the collections in my sample. The largest observed index is 8.76, 80% of the largest possible value (11).
Subtracting the backfile index from the current index provides a quick comparison of the current and backfile time periods. This yields a set of change metrics with values between -5.57 and 8.00. The extremes of this range provide examples that help understand how to visualize and interpret these numbers.
The largest value of the change metric (8.0) is observed for the University of Chitral, a Pakistani university dedicated to serving society by “developing in students a delicate intellectual, cultural, ethical, and humane sensitivity through a pleasant blend of ancient and modern wisdom”. We use a radar plot to visualize compare the metrics for the backfile and current time-periods (Figure 1). The eleven metadata elements are arranged around the radar plot with the % of records that include the elements shown along the axis of the plots with zero at the center and the maximum value, one in this case, at the outer edge of the plot.
The University of Chitral is a relatively new member of CrossRef with twenty-nine journal articles registered in 2017 and 2018, so they have no backfile data. In this case, the radar plot has data only for the current time-period. All eight of the elements included in the Participation Report are in 100% of the records, so all of the current data are around the outside edge of the plot and the current index is 8.0. This is an impressive starting point for this CrossRef member.
A longer-term member will include metadata from the backfile as well as the current time-period and provide information about evolution of the collection. The largest overall metric in this group is from the Rockefeller University Press which includes 57,526 registered items in the backfile and 1529 items in the current period. The detailed metrics for the current and backfile time-periods and the differences are shown in Table 3. The smallest differences are for Open References, Similarity Checking, and Update Policies which are all included in ~100% of the records during each time period. The largest difference is for Licenses which increased from 0.01 to 0.99. The overall metric increased from 3.40 to 8.76 for an increase of 5.36, the largest increase observed in my sample.
Time-period |
Total |
Abstracts |
Affiliations |
Award Numbers |
Funders |
Licenses |
Open References |
Orcids |
References |
Resource Links |
Similarity Checking |
Update Policies |
current |
8.76 |
0.94 |
0.25 |
0.79 |
0.84 |
0.99 |
1.00 |
0.95 |
0.97 |
0.05 |
1.00 |
1.00 |
backfile |
3.40 |
0.05 |
0.00 |
0.01 |
0.01 |
0.01 |
1.00 |
0.00 |
0.33 |
0.00 |
1.00 |
0.99 |
change |
5.36 |
0.89 |
0.25 |
0.78 |
0.83 |
0.98 |
0.00 |
0.94 |
0.63 |
0.05 |
0.00 |
0.01 |
Figure 2 shows the backfile (orange) and current (blue) metrics for Rockefeller University Press (orange). The title gives the member name, the content-type, the metrics for the current and backfile periods, and the difference, i.e. (current - backfile = change). The large increases in completeness for Abstracts, Award Numbers, Funders, Licenses, Orcids, and References seen in Table 3 are also apparent in this Figure as well as the similarities of coverages for Update Policies, Similarity Checking, and Open References for the two periods.
At the other end of the spectrum, members with large negative change metrics are those that have data during the backfile but not during the current period. The largest negative change is observed for Bayward Publishing Company, Inc. Figure 3 shows the radar plot for this member. In this case only the backfile has data so the plot shows just the orange data. This member has some very complete metadata during the backfile with four elements (Update Policies, Similarity Checking, Resource Links, and Licenses) included in all records and Affiliations and References included in over 70% of the records. It is interesting to note that the completeness in this collection occurs in generally different quadrants of the radar plot than in the other cases. At this point we do not know if this is a general pattern.
Conclusion – The CrossRef Participation Reports provide an opportunity for exploring snapshots of completeness for member metadata collections during several time periods. Data available using the CrossRef API make it possible to capture participation data for many members and to explore those data to identify patterns of metadata evolution. I suggest a simple set of metrics for describing and comparing metadata collections and show several end-member cases that demonstrate how those metrics can be applied to identify common patterns in evolution. Future posts will explore these data in more detail in order to quantitatively identify commonalities with the goal of understanding how CrossRef members are motivated to improve their metadata.