How-to: Matching multilingual thesaurus concepts with OpenRefine

Currently, the STW Thesaurus for Economics is mapped to Wikidata, one sub-thesaurus at a time. For the next part, "B Business Economics", we have improved our prior OpenRefine matching process. Though the use case - matching concepts in a multilingual thesaurus with lots of synonyms - shouldn't be uncommon, we couldn't find guidelines on the web and provide a description of what works for us here.

OpenRefine has the non-obvious capability to run multiple reconciliations on different columns, and to combine selected matched items from these columns. It is possible to use different endpoints for the reconciliation, in our case for English and for German Wikidata labels.

The input was prepared as a tab-separated table, with columns for the concept id, German labels, English labels, the GND, to which a concept was linked (if any), and auxiliary information for the context of a concept, plus links to the concepts for convenience. The table was produced by a SPARQL query on the STW endpoint. This query created a multi-valued labels column for each language, with the preferred label at the start and the alternate labels following - "Corporate objective | Business goal | Business objective | Corporate objective function | Goal setting of firms", for example.

The table was loaded into OpenRefine, and, as the very first operations, the labels fields were split up at the delimiter (Edit Cells/Split multi-valued cells). Now, the reconciling runs for each single label, as mentioned above, against the appropiate endpoint. In the case of STW, we had to "Reconcile against no particular type", and had to de-activate one default setting, "Auto-match candidates with high confidence". We didn't want to create matches by pure string similarity, but wanted to select the matching item manually. Often, multiple matches are offered for each label. Many of them are the titles of journals or scholarly articles, which OpenRefine unfortunately lets us not exclude from the matching process. However, with all synonyms looked up, we can be rather sure that the most fitting item is among the displayed matches, and it is normally ranking high in one of the lists.

Matching a STW descriptor with Wikidata

As an additional matching vector, we use the GND id. Almost all STW descriptors, and many items of Wikidata, have these ids. Therefore, we added a gndId column to the table, and also another byGnd column (with non-sensical values like "gnd4568727-4"). On the latter column, we invoked reconsiliation, set the same options as above, and additionally checked the gndId column in "Also use relevant details from other columns:", selecting Wikidata's GND-ID (P227) as property. With the reconsiliation against an arbitrary-language endpoint done, the items with matching GND are shown, while no other matches should occur.

This series of operations leaves us with three reconsiled columns and multiple suggested matches. Walkig over the results, only one match per record should be selected - as this is sufficient, and only one can be used anyway. With all records processed, a consolidation is necessarry, as the best-fitting match can be located in any of the three columns and in any of the rows of the record.

As the first consolidation step, we create a new qid column (with "Edit column/Add column based on this column" on an arbitrary existing column). We reference in the "Expression" field the three reconsiled columns and, at the same time, extract from them the Wikidata identifiers, i.e. QID values:

coalesce(cells['?byGnd'], cells['?byLabelsDe'], cells['?byLabelsEn']

OpenRefine's coalesce function takes the first non-null value. As the second consolidation step, we execute "Edit cells/Join multi-valued cells" on the qid column - the reverse operation of "Split multi-valued cells" on the labels fields at the beginning. We now have got a table with one row per concept, and can easily extract an STWID/QID table with all found matches.

Thus, OpenRefine allows us to automate the tedious process of looking up multiple labels in different languages, and let the human user just check the suggestions and select the correct match.