skos-history: New method for change tracking applied to STW Thesaurus for Economics

“What’s new?” and “What has changed?” are questions users of Knowledge Organization Systems (KOS), such as thesauri or classifications, ask when a new version is published. Much more so, when a thesaurus existing since the 1990s has been completely revised, subject area for subject area. After four intermediately published versions in as many consecutive years, ZBW's STW Thesaurus for Economics has been re-launched recently in version 9.0. In total, 777 descriptors have been added; 1,052 (of about 6,000) have been deprecated and in their vast majority merged into others. More subtle changes include modified preferred labels, or merges and splits of existing concepts.

Since STW has been published on the web in 2009, we went to great lengths to make change traceable: No concept and no web page has been deleted, everything from prior versions is still available. Following a presentation at DC-2013 in Lisbon, I've started the skos-history project, which aims to exploit published SKOS files of different versions for change tracking. A first beta implementation of Linked-Data-based change reports went live with STW 8.14, making use of SPARQL "live queries" (as described in a prior post). With the publication of STW 9.0, full reports of the changes are available. How do they work?

<--break->

The basic idea is to exploit the power of SPARQL on named graphs of different versions of the thesaurus. After having loaded these versions into a "version store", we can compute deltas (version differences) and save them as named graphs, too. A combination of the dataset versioning ontology (dsv:) by Johan De Smedt, the skos-history ontology (sh:), SPARQL service description (sd:) and VoiD (void:) provides the necessary plumbing in a separate version history graph:

 skos-history example graphs

That in place, we can query the version store, for e.g. the concepts added between two versions, like this:

# Identify concepts inserted with a certain version
#
SELECT distinct ?concept ?prefLabel
WHERE {
 # query the version history graph to get a delta and via that the relevant graphs
 GRAPH {
  ?delta a sh:SchemeDelta ;
   sh:deltaFrom/dc:identifier "8.14" ;
   sh:deltaTo/dc:identifier "9.0" ;
   sh:deltaFrom/sh:usingNamedGraph/sd:name ?oldVersionGraph ;
   dct:hasPart ?insertions .
  ?insertions a sh:SchemeDeltaInsertions ;
   sh:usingNamedGraph/sd:name ?insertionsGraph .
 }
 # for each inserted concept, a newly inserted prefLabel must exist ...
 GRAPH ?insertionsGraph {
  ?concept skos:prefLabel ?prefLabel
 }
 # ... and the concept must not exist in the old version
 FILTER NOT EXISTS {
  GRAPH ?oldVersionGraph {
   ?concept ?p []
  }
 }
}

The resulting report, cached for better performance and availability, can be found in the change reports section of the STW site, together with reports on deprecation/replacement of concepts, changed preferrred labels, hiearchy changes, merges and splits of concepts (descriptors as well as the higher level subject categories of STW). The queries used to create the reports are available on GitHub and linked from the report pages.

The methodology allows for aggregating changes over multiple versions and levels of the hierarchy of a concept scheme. That enabled us to gather information for the complete overhaul of STW, and to visualize it in change graphics:

STW relaunch: Business economics

The method applied here to STW is in no way specific to it. It does not rely on transaction logging of the internal thesaurus management system, nor on any other out-of-band knowledge, but solely on the published SKOS files. Thus, it can be applied to other knowledge management systems, by its publishers as well as by interested users of the KOS. Experiments with TheSoz, Agrovoc and the Finnish YSO have been conducted already; example endpoints with multiple versions of these vocabularies (and of STW, of course) are provided by ZBW Labs.

At the Finnish National Library, as well as the FAO, approaches are under way to explore the applicability of skos-history to the thesauri and maintenance workflows there. In the context of STW, the change reports are mostly optimized for human consumption. We hope to learn more how people use it in automatic or semi-automatic processes - for example, to update changed preferred label of systems working with prior versions of STW, to review indexed titles attached to split-up concepts, or to transfer changes to derived or mapped vocabularies. If you want to experiment, please fork on GitHub. Contributions in the issue queue as well as well as pull requests are highly welcome.

More detailed information can be found in a paper (Leveraging SKOS to trace the overhaul of the STW Thesaurus for Economics), which will be presented at DC-2015 in Sao Paulo. [Update: The presentation is available here.]