The "Integrated Authority File" (Gemeinsame Normdatei, GND) of the German National Library (DNB), the library networks of the German-speaking countries and many other institutions, is a widely recognized and used authority resource. The authority file comprises persons, institutions, locations and other entity types, in particular subject headings. With more than 134,000 concepts, organized in almost 500 subject categories, the subjects part - the former "Schlagwortnormdatei" (SWD) - is huge. That would make it a nice resource to stress-test SKOS tools - when it would be available in SKOS. A seminar at the DNB on requirements for thesauri on the Semantic Web (slides, in German) provided another reason for the experiment described below.
The GND subject headings are defined using a well-thought-out set of custom classes and properties, the GND Ontology (gndo). The GND links to other vocabularies with SKOS mapping properties, which technically implies for some, but not all, of its subject headings being skos:Concepts. Many of the gndo properties already mirror the SKOS or Isothes properties. For the experiment, the relevant subset of the whole GND was selected by the gndo:SubjectHeadingSensoStricto class. One single SPARQL construct query does the selection and conversion (execute on an example concept). For skos:pref/altLabel, derived from gndo:preferred/variantNameForTheSubjectHeading, German language tags are added. The fine-grained hiearchical relations of GNDO - generic, instantial, partitive - are dumped down to skos:broader/narrower. All original properties of a concept are included in the output of the query.
Some additional work was required to integrate the GND Subject Categories (gndsc), a skos:ConceptScheme of about 484 skos:Concepts which logically build a hierarchy. (In fact, the currently published file puts all subject categories on one level.) The subject headings invariably link by to one or more subject categories, but unfortunately the data has to be downloaded and added separately (with a bit of extension). The linking property from the subject headings, gndo:gndSubjectCategory, was already dumped down to skos:broader in the former query. Finally we add an explicit skos:notation and some bits of metadata about the concept scheme.
This earns us a large skos:ConceptScheme, which we called swdskos and which is currently avaliable in a SPARQL endpoint. Now, we can proceed, and try to prove that generic SKOS tools for display, verification and version history comparisons work at that scale.
Skosmos for thesaurus display
Skosmos is an open source web application for browsing controlled vocabularies, developed by the National Library of Finland. It requires a triple store with the vocabulary loaded. (The Skosmos wiki provides detailed installation and configuration help for this.) The configuration for the GND/SWD vocabulary takes only a few lines, following the provided template. The result can be found at http://zbw.eu/beta/skosmos/swdskos:
With marginal effort, we gained a structured concept display, a very nice browsing and hierarchical view interface, and a powerful search - out of the box. The initial alphabetical display takes a few seconds, due to the large number of terms for most of the index letters. In a production setting, that could be improved by adding a Squid or Varnish cache. The navigation from concept to concept is far below one second, so the tool seems well suited for practical use even with larger-than-usual vocabularies. For GND, it offers an alternative to the existing access over the DNB portal, more focused on browsing contexts and with a more precise search.
Quality assurance with qSKOS
Large knowledge organization systems are prone to human mistakes, which creep in even with strict rules and careful editing. Some maintenance systems try to catch some of these errors, but let slip others. So one of the really great things about SKOS as a general format for knowledge organization systems is that generic tools can be developed, which catch more and more classes of errors. qSKOS has identified a number of wide-spread possible quality issues, on which it provides detailed analytic information. Of course, often it depends on the vocabulary, which types if issues are considered as errors - for example, it is expected that most GND subject headings lack a definition, so a list of 100,000+ such concepts is not helpful, whereas the list of the (in total 3) cyclic hierarchical relations is. The parametrization we use for STW seems to provide useful results here too:
java -jar qSKOS-cmd.jar analyze -np -d -c ol,chr,usr,rc,mc,ipl,dlv,urc swdskos.ttl.gz -o qskos_extended.log
The tool has already been tested with very large vocabularies (LCSH, e.g.) On the swdskos dataset, it runs for 8 minutes, but it provides results, which could not be obtained otherwise. For example, the list of overlapping labels (report) reveals some strange clashes (example). Standard SKOS tools thus could complement the quality assurance procedures which are already in place.
Version comparisons with skos-history
The skos-history method allows to track changes in knowlege organization systems. It had been developed in the context of the STW overhaul. With swdskos, it proves to be applicable to much larger KOS. The loading of the three versions and the computation of all version deltas take almost half an hour (on a moderately sized virtual machine). That way, for example, we can see the 638 concepts, which were deleted between the Oct 2015 and the Feb 2016 dump of GND. Some checked concept URIs return concepts with different URIs, but the same preferred label, so we can assume that duplicates have been removed here. The added_concepts query can be extended to make use of the - often underestimated - GND subject categories for organizing the query results, as is shown here (list filtered by the notation for computer science and data processing):
These queries only scratch the surface of what could be done by comparing multiple versions of the GND subject headings. Custom queries could try to reveal maintenance patterns, or, for example, trace the uptake of the finer-grained hierarchical properties (generic/instantial/partitive) used in GND.
Generic SKOS tools seem to be useful to complement custom tools and processes for specialized knowledge organization systems. The tools considered here have shown no scalability issues with large vocabularies. The publication of an additional experimental SKOS version of the subject headings part of the GND linked data dump could perhaps instigate further research on the development of vocabulary.
The code and the data of the experiment are available here.