Thesaurus-augmented Search with Jena Text

How can we get most out of a thesaurus to support user searches? Taking advantage of SKOS thesauri published on the web, their mappings and the latest Semantic Web tools, we can support users both with synonyms (e.g. "accountancy" for "bookkeeping") for their original search terms as well as with suggestions for neighboring concepts.

Thesauri describe the concepts of a domain (e.g., economics, as covered by STW), enriched with lots of alternate labels (possibly in multiple languages). These terms form a cloud of search terms, which could be used on keyword and free-text fields in arbitrary databases - simply querying for ("term a" OR "term b" OR ...).

The relationships within the theaurus can be exploited additionally to bring up concepts (and their respective cloud of search terms) which are related to the original query of the user, and may or may not be instrumental for further exploration of the query space.

Basic search algorithm using text retrieval

Unfortunately however, users tend to rarely use the "correct" thesaurus terms or straight synonyms in their search queries. Especially, if the query consists of more than one word, these words may ask for a single concept (such as "free trade zone") or multiple concepts (such as "financial crisis in china"), or may even contain proper names (e.g, "siebert environmental economics"). And of course, within the single search box, the terms referring to different concepts are not separated neatly, but occur without delimiter in any order.

So the first task is to identify the concepts which may be relevant to a given search query. We see no reasonable chance for an exact analysis of maybe complex search queries in real time in order to split the query string and identify the single concepts (or persons, or things) according to the user's intention. Instead, we prefer to look at thesaurus concepts as text documents, with all their preferred or alternate or even hidden labels (the latter sometimes used to cover common misspellings) as part of the document's text. We then can build a text index on these documents. This done, we can just take the unmodified query string and apply standard text search algorithms to get the best fitting documents for this query. What "best fitting" means exactly depends on the scoring algorithms of the retrieval software. Here, heuristics come into play. These heuristics can be tuned often according to the special use case and document set. Using the defaults of the software described below we felt no urge to tune, since it worked pretty well out-of-the-box.

Having the matching concepts identified, the thesaurus relationships can be used to offer neighboring concepts. We found the skos:narrower and the skos:related relationships most useful, but this may depend on the domain and on the construction of the particular thesaurus. In a final step, all concepts can be enriched with their labels, in order to be available for building search queries.

From a practical point of view, it is important that the three steps described here are executed in a single query. As it is executed over the network, having more than one round-trip could impact performance considerably.

Implementation as SPARQL query

We used the upcoming Jena Text module as part of the Jena Fuseki RDF/SPARQL server to implement the algorithm described above. On top of a triple store, it feeds the /combined1 web service for economics (1). The text module will be generally available with the next Jena 2.10.2 / Fuseki 0.2.8 version, but is quite stable already (snapshot here).

Jena Text allows us to ask for concepts related to, e.g., "telework" with the statement

  ?concept text:query ('telework' 1)

The Lucene query format could be used to further specify the query. The second, numeric argument in the argument list allows us to limit the results. We suppose that the maximum number of different concepts asked for in the query may not be higher than the number of words, so in general we put the query word count as a limit.

When we want to include narrower and related concepts (and their relationship to the matched concepts) in the result, we have to extend the query:

 {
   { ?concept text:query ('telework' 1) }
  UNION
   { ?concept text:query ('telework' 1) .
    ?concept skos:narrower ?narrower
   }
  UNION
   { ?tmpConcept text:query ('telework' 1) .
    ?tmpConcept skos:narrower ?concept
   }
  UNION
   { ?concept text:query ('telework' 1) .
    ?concept skos:related ?related
   }
  UNION
   { ?tmpConcept text:query ('telework' 1) .
    ?tmpConcept skos:related ?concept
   }
 }

The repetition of the text query is ugly (perhaps someone can suggest a more elegant solution), but seems to do no real harm in terms of execution time.

The enrichment with the labels is done by a final statement:

 {
   { ?concept skos:prefLabel ?prefLabel }
  UNION
   { ?concept skos:prefLabel ?label }
  UNION
   { ?concept skos:altLabel ?label }
  UNION
   { ?concept skos:hiddenLabel ?label }
 }
 BIND (lcase(?label) AS ?hiddenLabel)

This returns the skos preferred labels as such, and additionally a superset of all skos labels in lowercase form (as ?hiddenLabel), which comes in handy for building search queries.

Making use of thesaurus mappings

More and more mappings between thesauri (and sometimes other datasets, such as DBpedia) are published. Here an example concept from STW mapped to a concept from the Thesaurus for the Social Sciences (generated with Visual RDF):

These mappings can be exploited twofold:

  1. the relevant concepts for a query can be looked up using synonyms (e.g., "Öko-Auditing") defined for the mapped concepts
  2. the list of synonyms returned for the identified concepts can be extended by all synonyms of the mapped concepts

In order to simplify both of these steps, we set up some data preprocessing, eliminating structural particularities of the mapped vocabularies before or while loading it into our triple store. Reducing skos-xl property chains to just the simple skos labeling properties (pref/alt/hiddenLabel) or replacing DBpedia rdfs:label by skos:prefLabel allows us to use the skos labeling properties consistently in our queries. The complete, executable version looks this:

PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX skos:  <http://www.w3.org/2004/02/skos/core#>
PREFIX zbwext: <http://zbw.eu/namespaces/zbw-extensions/>
PREFIX text:  <http://jena.apache.org/text#>

SELECT DISTINCT ?concept ?prefLabel ?hiddenLabel ?narrower ?related
WHERE {
 {
   { ?concept text:query ('telework' 2) }
  UNION
   { ?tmpConcept text:query ('telework' 2) .
    ?tmpConcept skos:exactMatch ?concept
   }
  UNION
   { ?concept text:query ('telework' 2) .
    ?concept skos:narrower ?narrower
   }
  UNION
   { ?tmpConcept text:query ('telework' 2) .
    ?tmpConcept skos:narrower ?concept
   }
  UNION
   { ?concept text:query ('telework' 2) .
    ?concept skos:related ?related
   }
  UNION
   { ?tmpConcept text:query ('telework' 2) .
    ?tmpConcept skos:related ?concept
   }
 }
 ?concept rdf:type zbwext:Descriptor .
 {
   { ?concept skos:prefLabel ?prefLabel }
  UNION
   { ?concept skos:prefLabel ?label }
  UNION
   { ?concept skos:altLabel ?label }
  UNION
   { ?concept skos:hiddenLabel ?label }
  UNION
   { ?concept skos:exactMatch/skos:prefLabel ?label }
  UNION
   { ?concept skos:exactMatch/skos:altLabel ?label }
  UNION
   { ?concept skos:exactMatch/skos:hiddenLabel ?label }
 }
 BIND (lcase(?label) AS ?hiddenLabel)
}

The BIND statement returns all synonyms in lower case, which in combination with "SELECT DISTINCT" filters out semi-duplicate labels with different capitalization. The query makes use of the elegant new SPARQL 1.1 feature of property paths (skos:exactMatch/skos:altLabel). It currently solely uses the skos:exactMatch mapping property, which is the only one declared as transitive, so the chains should be valid too. For the relationships one thesaurus (STW) is used. As you may have spotted, the maximum number of returned concepts was increased in this query, because with multiple thesauri involved, more than one concept can exactly fit the query. (The reduction to STW descriptors follows only later on.) 

In this setting, STW serves as a backbone, which is enriched by synonyms from other thesauri. We chose this design because the introduction of multiple, partly overlapping concepts and hierarchies from other thesauri (focused on subject areas such as social sciences or agriculture) might leave users profoundly confused. On the other side, we would not expect much additional value for the field of economics, because it is covered by STW quite well.

Setting up the text index

To set up a text index with Fuseki, a service and dataset has to be created within the config.ttl file:

<#service_xyz#> rdf:type fuseki:Service ;
  rdfs:label           "xyz TDB Service (R)" ;
  fuseki:name           "xyz" ;
  fuseki:serviceQuery       "query" ;
  fuseki:serviceQuery       "sparql" ;
  fuseki:serviceReadGraphStore  "data" ;
  fuseki:serviceReadGraphStore  "get" ;
  fuseki:dataset      :xyz ;
  .
:xyz rdf:type   text:TextDataset ;
  text:dataset <#xyzDb> ;
  text:index  <#xyzIndex> ;
  .
<#xyzDb> rdf:type   tdb:DatasetTDB ;
  tdb:location "/path/to/tdb/files" ;
  tdb:unionDefaultGraph true ; # Optional
  .
<#xyzIndex> a text:TextIndexLucene ;
  text:directory <file:/path/to/lucene/files> ;
  text:entityMap <#entMap> ;
  .
<#entMap> a text:EntityMap ;
  text:entityField   "uri" ;
  text:defaultField  "text" ; # Must be defined in the text:map
  text:map (
     [ text:field "text" ; text:predicate skos:prefLabel ]
     [ text:field "text" ; text:predicate skos:altLabel ]
     [ text:field "text" ; text:predicate skos:hiddenLabel ]
     ) .

Basicly, this defines locations for the database (Jena TDB) and index files. The entMap statement contains the text indexing logic. The properties to be indexed are listed and mapped to a index prefix. In our case, we just use one default index ("text") for all properties.

Maintaining database and index

Fuseki supports SPARQL update, and so the simplest way to manage database and index is to just use the HTTP interface to load the data. This creates or updates the database and index files. For our use case, we however preferred to define the datastore as read-only (as in the config given above), and to rely on command line calls like this:

 java -cp $FUSEKI_HOME/fuseki-server.jar tdb.tdbloader --tdb=config_file data_file
 java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --desc=config_file

The documentation points out one caveat: Index entries are added, but not deleted, when then the RDF data is modified. This can be worked arround, or, in cases of larger changes, the index can just be rebuilt.

Just as a side note: As an alternative to building Lucene indexes as described here, Fuseki supports the use of externally defined and maintained Solr indexes. This should allow for all kinds of neat tricks, but is currently sparsely documented.

Practical use

The web service backed based on Fuseki and Jena Text as described above is available at zbw.eu/beta/econ-ws, and is supported by documentation and executable examples. The integration into an application can be done server-side, in the applications backend. Or it can also be accomplished client-side by pulling together the synonyms in a search string (for Google, this could be "term a" OR "term a" OR ...), putting it into the search box, and executing the search, just through Javascript on the page.

Have a look at ZBW's EconStor digital repository of papers in economics, where the service is in production use. Just try this URL. You will notice a short delay, after which search suggestions from STW thesaurus are displayed. And when you click at one of these, a search is executed with all the synonyms filled into the search box. We don't have yet enough data for an analysis of response times, but we very seldom see more than half a second - which is fast enough for the current application -, and the vast majority of responses takes less than 100 ms (on commonplace hardware).

For sure, the algorithms described here and their implementation as SPARQL queries may be subject to critical discussion or tuning efforts. I'd be happy to receive your feedback. A SPARQL endpoint with the STW dataset and the according text index is publicly available at http://zbw.eu/beta/sparql.

Footnote

1) The service has existed since 2009. Till recently it was based on the SPARQLite implementation by Alistair Miles, Graham Kline and others. SPARQLite was Jena-based too, designed for scalability, quality-of-service and reliability (even in case of DOS attacks), and had the then recent LARQ (Lucene + ARQ) module integrated. It thus allowed an early implementation of the algorithms described here - hence the query code was much uglier than now with Jena Text. The development of SPARQLite was discontinued, however, in 2010.