Other editions of this work: An experiment with OCLC's LOD work identifiers

Large library collections, and more so portals or discovery systems aggregating data from diverse sources, face the problem of duplicate content. Wouldn't it be nice, if every edition of a work could be collected beyond one entry in a result set?

The WorldCat catalogue, provided by OCLC, holds more than 320 million bibliographic records. Since early in 2014, OCLC shares its 197 million work descriptions as Linked Open Data: "A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. ... In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat." The works and editions are marked up with schema.org semantic markup, in particular using schema:exampleOfWork/schema:workExample for the relation from edition to work and vice versa. These properties have been added recently to the schema.org spec, as suggested by the W3C Schema Bib Extend Community Group.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it's bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

As a basis for our experiment, we extracted the "oclc subset" from EconBiz. Round about one third of the instances in this subset is in English - and therefore most likely to be included in WorldCat from other sources also -, one third in other languages (mostly German), and for one third the language is unknown. We randomly selected 100,000 instances from the 1.2 million "oclc subset". We looked up the work id for each of the attached oclc numbers, and in turn the editions of this work.

An example for the work/edition linking

We start with oclc number 247780068, which represents an edition of the book "Changes in the structure of employment with economic development" by Amarjit S. Oberai (WorldCat, EconBiz).

A lookup via

curl -LH "Accept: application/ld+json" http://www.worldcat.org/oclc/247780068

returns data in JSON-LD format. Here a heavily shortened version (the full data of the example is available on GitHub):

{
    "exampleOfWork" : "http://worldcat.org/entity/work/id/14820164",
    "schema:name" : "Changes in the structure of employment with economic development",
    "schema:datePublished" : "1978",
    "@id" : "http://www.worldcat.org/oclc/247780068",
    "workExample" : [
      "http://worldcat.org/isbn/9789221019268",
      "http://worldcat.org/isbn/9789221027737"
    ]
}

We go up the hierarchy by picking the "exampleOfWork" URI:

curl -LH "Accept: application/ld+json" http://worldcat.org/entity/work/id/14820164

and get the data for the work (heavily shortened again)

{
    "schema:name" : [
      "Changes in the structure of employment with economic development",
      {
          "@value" : "Changes in the structure of employment with economic development",
          "@language" : "en"
      },
      {
          "@value" : "Changes in the structure of employment with economic development /",
          "@language" : "en"
      },
      "Changes in the structure of employment with economic development /",
      "Changes in the structure of employment with economic development."
    ],
    "@id" : "http://worldcat.org/entity/work/id/14820164",
    "workExample" : [
      "http://www.worldcat.org/oclc/609785684",
      "http://www.worldcat.org/oclc/784152303",
      "http://www.worldcat.org/oclc/797167669",
      "http://www.worldcat.org/oclc/4563017",
      "http://www.worldcat.org/oclc/245743486",
      "http://www.worldcat.org/oclc/730127832",
      "http://www.worldcat.org/oclc/247780068",
      "http://www.worldcat.org/oclc/716102743",
      "http://www.worldcat.org/oclc/732345392",
      "http://www.worldcat.org/oclc/466428327",
      "http://www.worldcat.org/oclc/716519725",
      "http://www.worldcat.org/oclc/472872207",
      "http://www.worldcat.org/oclc/760439428",
      "http://www.worldcat.org/oclc/254548387",
      "http://www.worldcat.org/oclc/705976718",
      "http://www.worldcat.org/oclc/781155896",
      "http://www.worldcat.org/oclc/8215485",
      "http://www.worldcat.org/oclc/803773951"
    ]
}

As we can observe, different forms of the title of the work (with and w/o language tag) are collected in the "schema:name" property - WorldCat does not try to determine an authoritative title for the work. Data from different editions is also collected for authors or subjects. Sometimes literal values are complemented by URIs, for example to VIAF (persons) or LCSH (subjects).

We now can look up other editions of this work, given in a "workExample" property, e.g.

curl -LH "Accept: application/ld+json" http://www.worldcat.org/oclc/245743486

which reveals a later edition of the same work:

{
    "exampleOfWork" : "http://worldcat.org/entity/work/id/14820164",
    "schema:name" : "Changes in the structure of employment with economic development",
    "schema:datePublished" : "1981",
    "@id" : "http://www.worldcat.org/oclc/245743486",
    "schema:bookEdition" : "2nd ed",
    "workExample" : "http://worldcat.org/isbn/9789221027737"
}

WorldCat itself seems to use this data in it's View all editions and formats link on the human-readable web pages for the editions. The ISBN-based "workExample" links on the edition level redirect to oclc numbers; their purpose and use seems to be not yet documented.

The experiment

The starting point for the experiment outlined above was the query to what extent such work/edition links exist for a real-world collection like EconBiz. For the randomly selected 100,000 instances (editions) with oclc numbers, we looked up the edition by its URI and extracted the related work (if such a work exists). We than looked up the work, and extracted the oclc numbers of all its editions. For each oclc number of the starting set, we saved a list of oclc numbers of other editions as json data structure. This took about 44 hours of runtime. (We deliberately didn't parallelize the network access to avoid overloading the server, and cached results to save some lookups). For 15 of the lookups we got a 500 "internal server error", for 71 a 404 "not found". These errors occurred on edition as well as on work lookups, with no recognizable pattern. Some random tests revealed that normally a second lookup of the url was successful. Due to their small number we ignored these errors in the further course of the experiment.

In a second step, we evaluated the resulting data in respect to the whole of WorldCat, and to the data of our collection. All code and data for the experiment (and prior run seven weeks earlier, which did not show significantly different results) is avaliable on GitHub.

The results

For more than 99 % of our test data set we found valid WorldCat work ids. For a total of 880 oclc numbers we couldn't retrieve a work; in 922 cases the work did not link back to the oclc number from which we started. So in this early stage of WorldCat Linked Data (flagged still as "experimental") there seem to be some minor gaps and inconsistencies in the Work/Edition linkage. Yet, the results show that more than 60 % of WorldCats editions of our test set link to a work with at least one other edition within WorlCat.

Number of editions per work re. all OCLC numbers from WorldCat re. 1,260,337 OCLC numbers from EconBiz
1 37847 92879
2 13734 4826
3-5 23697 1225
6-10 14663 137
11-50 8563 50
51-100 443 2
101-9999 173 1

When we take into account, which of these other editions are in the holdings of EconBiz, the number boils down to 6.2 % of the test set.

The resulting edition clusters themselves, the clustering algorithms, and in the end the cataloging practices they result from, require further analysis and discussion. A quick glance at the largest clusters in EconBiz reveal that they result from serials: Indian village surveys, country profiles or economical analysis for different countries. If particularly these clusters make sense to users, seems questionable.

How could this be useful?

One aim in a larger subject portal like EconBiz, which merge several data sources, is the reduction of duplicates in result sets received by the users. Unfortunately, only a minor part (1,2 of 8 million records) of the EconBiz holdings have oclc numbers, and only a fraction of these form clusters within these holdings. So currently the WorldCat work clusters could only be a tiny piece of the de-duplication puzzle. For the development of custom de-duplication algorithms however the data may create a starting point, by providing firstly a pool of possible example cases, and secondly a counterpart for statistical analysis of results. (In a recent blog entry with some answers to early question about the OCLC work entities, Richard Wallis points to OCLC's FRBR Work-Set Algorithm, which has been described in a 2002 D-Lib magazine article.) Some random samples revealed a situation de-duplication also for a few instances can be highly helpful: When working papers or other sources have records with and without attached links to the full text, work clusters could be exploited to display always a link to the PDF, when an instances/edition is presented to the users.

Another area, where work clusters could be useful immediately, is the ranking of search results. If we suppose that works for which multiple instances exist are more relevant, we can use that as a ranking factor (surely among others). Since it does not make a crucial difference where these editions exist, we here can base such an assumption on the whole of WorldCat, and thus can add such a ranking factor for a much larger part of our existing data.

This does not even touch the most exiting field for exploitation: The descriptions on the work as well as on the edition level. For subject indexing and classification, this has been investigated by Magnus Pfeffer (slides) and Kai Eckert, e.g. in the UB Mannheim Linked Data Service and continued in the Culturegraph project. Possible applications are the collection and merging of index terms or classes from different editions of a work, or perhaps also an evaluation of indexing consistency. Heidrun Wiesenmüller suggested the use of work clusters for the enrichment with personal name authority data, or even the enrichment of the authority itself (slides, in German).

OCLC has announced further development of the service: "WorldCat Works will continue to be enhanced over the coming months and years.  The data will get cleaner, the descriptions will get richer, and the linking will get better."

----

With thanks to Kirsten Jeude, Kim Plassmeier and Timo Borst for hints and discussions.