Over the last several years CrossRef has grown into one of the largest and most significant metadata repositories in the world. It contains over 100,000,000 registered content items and offers many services to help members and other users take advantage of that content. In addition to managing registered content, CrossRef makes links between resources. It is the connective tissue for the web of scholarly communications.

Of course, CrossRef knows that complete and consistent metadata are the lifeblood of high-quality services. They recently introduced Participation Reports (with a help page) to help members and users get a handle on completeness of CrossRef metadata collections. These reports provide completeness metrics (% of records) for 10 key metadata elements for nineteen content types over three time-periods: current (past two years and year-to-date), backfile (older), and all.

With over 12,000 members with registered content, these reports are helpful for CrossRef members and the mother-lode for information about how metadata collections evolve over time. Even better, this mother-lode is available in bulk through the CrossRef API as well as the temporal snapshots created by the Participation Reports.

I used the CrossRef API to download complete participation report data for a sample of 1684 members, together providing over 6400 member/content-type/time-period combinations, each termed a metadata collection. These data were retrieved using sets of parameters shown in the last row of Table 1.

Each section of Table 1 shows the number of occurrences of each time-period (all, backfile, current) for each coverage-type in this sample. For example, the first sample, in columns 1-4, included book metadata for 28 members covering all time, 27 members covering backfile, and for 9 members covering current time. The second sample, in columns 5-8, included book metadata for 36 members covering all time, 16 members covering the backfile, and 31 members covering current time.

Table 1. Coverage types, Time periods, and Collection Counts

Coverage Type	all	backfile	current	Coverage Type	all	backfile	current
book	28	27	9	book	36	16	31
book-chapter	26	16	7	book-chapter	16	2	14
book-series	3			book-series	1
book-set	2
component	48			component	16
dataset	74	9	9	dataset	7	1	1
dissertation	2			dissertation	6
journal	101			journal	368
journal-article	843	838	683	journal-article	757	310	747
journal-issue	47	40	15	journal-issue	383	128	372
journal-volume	1			journal-volume	6
monograph	12	12	5	monograph	21	10	16
other	6	5	3
posted-content	2		2	posted-content	3	2	3
proceedings	6	6	1	proceedings	24	1	24
proceedings-article	18	18	10	proceedings-article	33	6	33
				proceedings-series	2	1	2
reference-book	3	3		reference-book	1	1	1
report	16	15	6	report	18	8	16
				report-series	1	1	1
standard	1
Parameters: rows = 1000, offset = 0				Parameters: rows = 1000, offset = 8001

It is clear from these data that most resources in this sample are journal-articles, with hundreds of members reporting in each time-period. An analysis of metadata evolution requires samples from more than one time period, so this initial work focuses only on journal-articles and compares current and backfile metrics. This restriction yields more than 2600 collections for analysis.

The member data retrieved using the API includes eleven individual metrics, each with a possible range of zero to one. The names of these elements are slightly different than the field labels on the Participation Reports and both sets of names are listed in Table 2.

Table 2. Metadata item names in CrossRef API and Participation Reports

API Elements	Participation Reports	API Elements	Participation Reports
Abstracts	Abstracts	Orcids	ORCID IDs
Affiliations		References	References
Award Numbers	Funding award numbers	Resource Links	Text mining URLs
Funders	Funder registry IDs	Similarity Checking	Similarity checking URLs
Licenses	License URLs	Update Policies	Crossmark enabled
Open References	Open References

An unweighted sum of these numbers yields a single metric for each time-period with a range between zero and eleven. This metric was calculated for all of the collections in my sample. The largest observed index is 8.76, 80% of the largest possible value (11).

Subtracting the backfile index from the current index provides a quick comparison of the current and backfile time periods. This yields a set of change metrics with values between -5.57 and 8.00. The extremes of this range provide examples that help understand how to visualize and interpret these numbers.

The largest value of the change metric (8.0) is observed for the University of Chitral, a Pakistani university dedicated to serving society by “developing in students a delicate intellectual, cultural, ethical, and humane sensitivity through a pleasant blend of ancient and modern wisdom”. We use a radar plot to visualize compare the metrics for the backfile and current time-periods (Figure 1). The eleven metadata elements are arranged around the radar plot with the % of records that include the elements shown along the axis of the plots with zero at the center and the maximum value, one in this case, at the outer edge of the plot.

Figure 1. Data for University of Chitral - This is the CrossRef member in my sample with the largest difference between the backfile and the current time periods. Note that there is no data during the backfile period and that current content is comp… — Figure 1. Data for University of Chitral - This is the CrossRef member in my sample with the largest difference between the backfile and the current time periods. Note that there is no data during the backfile period and that current content is complete (on the outside of the circle) for all elements that have content.

The University of Chitral is a relatively new member of CrossRef with twenty-nine journal articles registered in 2017 and 2018, so they have no backfile data. In this case, the radar plot has data only for the current time-period. All eight of the elements included in the Participation Report are in 100% of the records, so all of the current data are around the outside edge of the plot and the current index is 8.0. This is an impressive starting point for this CrossRef member.

A longer-term member will include metadata from the backfile as well as the current time-period and provide information about evolution of the collection. The largest overall metric in this group is from the Rockefeller University Press which includes 57,526 registered items in the backfile and 1529 items in the current period. The detailed metrics for the current and backfile time-periods and the differences are shown in Table 3. The smallest differences are for Open References, Similarity Checking, and Update Policies which are all included in ~100% of the records during each time period. The largest difference is for Licenses which increased from 0.01 to 0.99. The overall metric increased from 3.40 to 8.76 for an increase of 5.36, the largest increase observed in my sample.

Table 3. Rockefeller University Press metrics for eleven metadata items and three time periods (current, backfile, and all). Scroll the Table right-left to see all columns.

Time-period	Total	Abstracts	Affiliations	Award Numbers	Funders	Licenses	Open References	Orcids	References	Resource Links	Similarity Checking	Update Policies
current	8.76	0.94	0.25	0.79	0.84	0.99	1.00	0.95	0.97	0.05	1.00	1.00
backfile	3.40	0.05	0.00	0.01	0.01	0.01	1.00	0.00	0.33	0.00	1.00	0.99
change	5.36	0.89	0.25	0.78	0.83	0.98	0.00	0.94	0.63	0.05	0.00	0.01

Figure 2 shows the backfile (orange) and current (blue) metrics for Rockefeller University Press (orange). The title gives the member name, the content-type, the metrics for the current and backfile periods, and the difference, i.e. (current - backfile = change). The large increases in completeness for Abstracts, Award Numbers, Funders, Licenses, Orcids, and References seen in Table 3 are also apparent in this Figure as well as the similarities of coverages for Update Policies, Similarity Checking, and Open References for the two periods.

Figure 2. Rockefeller University Press shows the largest increase in completeness for members with in my sample with data in the backfile and current time periods. Note that content was introduced for many elements and continued at a high level of c… — Figure 2. Rockefeller University Press shows the largest increase in completeness for members with in my sample with data in the backfile and current time periods. Note that content was introduced for many elements and continued at a high level of completeness for others.

At the other end of the spectrum, members with large negative change metrics are those that have data during the backfile but not during the current period. The largest negative change is observed for Bayward Publishing Company, Inc. Figure 3 shows the radar plot for this member. In this case only the backfile has data so the plot shows just the orange data. This member has some very complete metadata during the backfile with four elements (Update Policies, Similarity Checking, Resource Links, and Licenses) included in all records and Affiliations and References included in over 70% of the records. It is interesting to note that the completeness in this collection occurs in generally different quadrants of the radar plot than in the other cases. At this point we do not know if this is a general pattern.

Figure 3. Bayward Publishing has very complete content during the backfile time period, but no data during the current time period. This results in a negative change of -5.57.

Conclusion – The CrossRef Participation Reports provide an opportunity for exploring snapshots of completeness for member metadata collections during several time periods. Data available using the CrossRef API make it possible to capture participation data for many members and to explore those data to identify patterns of metadata evolution. I suggest a simple set of metrics for describing and comparing metadata collections and show several end-member cases that demonstrate how those metrics can be applied to identify common patterns in evolution. Future posts will explore these data in more detail in order to quantitatively identify commonalities with the goal of understanding how CrossRef members are motivated to improve their metadata.

Blog

Metadata Evolution - CrossRef Participation Reports

Metadata Game Changers

Tell us what you think!