Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers

2017-11-30 by Joachim Neubert

In the EconBiz portal for publications in economics, we have data from different sources. In some of these sources, most notably ZBW's "ECONIS" bibliographical database, authors are disambiguated by identifiers of the Integrated Authority File (GND) - in total more than 470,000. Data stemming from "Research papers in Economics" (RePEc) contains another identifier: RePEc authors can register themselves in the RePEc Author Service (RAS), and claim their papers. This data is used for various rankings of authors and, indirectly, of institutions in economics, which provides a big incentive for authors - about 50,000 have signed into RAS - to keep both their article claims and personal data up-to-date. While GND is well known and linked to many other authorities, RAS had no links to any other researcher identifier system. Thus, until recently, the author identifiers were disconnected, which precludes the possibility to display all publications of an author on a portal page.

To overcome that limitation, colleagues at ZBW have matched a good 3,000 authors with RAS and GND IDs by their publications (see details here). Making that pre-existing mapping maintainable and extensible however would have meant to set up some custom editing interface, would have required storage and operating resources and wouldn't easily have been made publicly accessible. In a previous article, we described the opportunities offered by Wikidata. Now we made use of it.

Initial situation in Wikidata

Economists were, at the start of this small project in April 2017, already well represented among the 3.4 million persons in Wikidata - though the precise extent is difficult to estimate. Furthermore, properties for linking GND and RePEc author identifiers to Wikidata items were already in place:

P227 “GND ID”, in ~375,000 items
P2428 “RePEc Short-ID” (further-on: RAS ID), in ~2,200 items
both properties in ~760 items

For both properties, “single value” and “distinct values” constraints are defined, so that (with rare exceptions) a 1:1 relation between the authority entry and the Wikidata item should exist. That, in turn, means that a 1:1 relation between both authority entries can be assumed.

The relative amounts of IDs in EconBiz and Wikidata is illustrated by the following image.

Venn diagram at project start

Person identifiers in Wikidata and EconBiz, with unknown overlap at the beginning of the project (the number of 1.1 million persons in EconBiz is a very rough estimate, because most names – outside GND and RAS – are not disambiguated)

Since many economists have Wikipedia pages, from which Wikidata items have been created routinely, the first task was finding these items and adding GND and/or RAS identifiers to them. The second task was adding items for persons which did not already exist in Wikidata.

Adding mapping-derived identifiers to Wikidata items

For items already identified by either GND or RAS, the reciprocal identifiers where added automatically: A federated SPARQL query on the mapping and the public Wikidata endpoint retrieved the items and the missing IDs. A script transformed that into input for Wikidata’s QuickStatements2 tool, which allows adding statements (as well as new items) to Wikidata. The tool takes csv-formatted input via a web form and applies it in batch to the live dataset.

Quickstatements2 insert

Import statements for QuickStatements2. The first input line adds the RAS ID “pan31” to the item for the economist James Andreoni. The rest of the input line creates a reference to ZBWs mapping for this statement and so allows tracking its provenance in Wikidata.

That step resulted in 384 added GND IDs to items identified by RAS ID, and, in the reverse direction, 77 added RAS IDs to items identified by GND ID. For the future, it is expected that tools like wdmapper will facilitate such operations.

Identifying more Wikidata items

Obviously, the previous step left out the already existing economists in Wikidata, which up to then had neither a GND nor a RAS ID. Therefore, these items had to be identified by adding one of the identifiers. A semi-automatic approach was applied to that end, starting with the “most important” persons from RePEc and EconBiz datasets. That was extended in an automatic step, taking advantage of existing VIAF identifiers (a step which could have been also the first one).

For RePEc, the “Top economists” ranking page (~4,600 authors) was scraped and cross-linked to a custom-created basic RDF dataset of the RePEc authors. The result was transformed to an input file for Wikidata’s Mix’n’match tool, which had been developed for the alignment of external catalogs with Wikidata. The tool takes a simple CSV file, consisting of a name, a description and an identifier, and tries to automatically match against Wikidata labels. In a subsequent interactive step, it allows to confirm or remove every match. If confirmed, the identifier is automatically added as value to the according property of the matched Wikidata item.

For GND, all authors with more than 30 publications in EconBiz where selected in a custom SPARQL endpoint. Just as the “RePEc Top” matchset, a “GND economists (de)” matchset with ~18,000 GND IDs, names and descriptions was loaded into Mix’n’match and aligned to Wikidata.

Becoming more familiar with the Wikidata-related tools, policies and procedures, existing VIAF property values were exploited as another opportunity for seeding GND IDs in Wikidata. In a federated SPARQL query on a custom VIAF and the public Wikidata endpoint, about 12,000 missing GND IDs were determined and added to Wikidata items which had been identified by VIAF ID.

After each of these steps, the first task – adding mapping-derived GND or RAS identifiers – was repeated. That resulted in 1908 Wikidata items carrying both IDs. Since ZBWs author mapping based on at least 10 matching publications, the alignment of high-frequency resp. highly-ranked GND and RePEc authors made it highly probable that authors already present in Wikidata were identified in the previous steps. That reduced the danger of creating duplicates in the following task.

Creating new Wikidata items from the mapped authorities

For the rest of the authors in the mapping, 2179 new Wikidata items were created. This task was carried out again by the QuickStatements2 tool, for which the input statements were created by a script, based on a SPARQL query on the afore-mentioned endpoints for RePEc authors and GND entries. The input statements were derived from both authorities, in the following fashion:

the label (name of the person) was taken from GND
the occupation “economist” was derived from RePEc (and in particular from the occurrence in its “Top Economists” list)
gender and date of birth/death were taken from GND (if available)
the English description was a concatenated string “economist” plus the affiliations from RePEc
the German description was a concatenated string “Wirtschaftswissenschaftler/in” plus the affiliations from GND

The use of Wikidata’s description field for affiliations was a makeshift: In the absence of an existing mapping of RePEc (and mostly also GND) organizations to Wikidata, it allows for better identification of the individual researchers. In a later step, when according organization/institute items exist in Wikidata and mappings are in place, the items for authors can be supplemented step-by-step by formal “affiliation” (P1416) statements.

According to Wikidata’s policy, an extensive reference to the source for each statement in the synthesized new Wikidata item was added.

The creation of items in an automated fashion involves the danger of duplicates. However, such duplicates turned up only in very few cases. They have been solved by merging items, which technically is very easy in Wikidata. Interestingly, a number of “fake duplicates” indeed revealed multifarious quality issues, in Wikidata and in both of the authority files, which, too, have been subsequently resolved.

... and even more new items for economists ...

The good experiences so far let us get bolder, and we considered creating Wikidata items for the still missing "Top Economists" (according to RePEc).

For item creation, one aspect we had to consider was the compliance with Wikidata's notability policy. This policy is much more relaxed than the policies of the large Wikipedias. It states as one criterion sufficient for item creation that the item "refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references." There seems to be some consensus in the community that authority files such as GND or RePEc authors count as "serious and publicly available references". This of course should hold even more for a bibliometric ranked subset of these external identifiers.

We thus inserted another 1,839 Wikidata items for the rest of the RePEc Top 10 % list. Additionally - to mitigate the immanent gender bias such selections often bear - we imported all missing researchers from RePEc's "Top 10 % Female Economists" list. Again, we added reference statements to RePEc which allow Wikidata users to keep track of the source of the information.

Results

The immediate result of the project was:

all of the 3081 pairs of identifiers from the initial mapping by ZBW is incorporated now in Wikidata items
1217 Wikidata items in addition to these also have both identifiers (created by individual Wikidata editors, or the efforts described above)

(All numbers in this section as of 2017-11-13.) While that still is only a beginning, given the total amount of authors represented in EconBiz, it is a significant share of the “most important” ones:

Venn top economists

Top 10 % RAS and frequent GND in EconBiz (> 30 publications). “Wikidata economists” is a rough estimate of the amount of persons in the field of economics (twice the number of those with the explicit occupation “economist”)

While the top RePEc economists are now completely covered by Wikidata, for GND the overlap has been improved significantly during the last year. This occured in parts as a side-effect of the efforts described above, in parts it is caused by the genuine growth of Wikidata in regard to the number of items as well as the increasing density of external identifiers.

Here the current percentages, compared to those one year earlier, which were presented in our previous article:

Economists in Wikidata

Large improvements in the coverage of the most frequent authors by Wikidata (query, result)

While the improvements in absolute numbers are impressive, too - the number of GND IDs for all EconBiz persons (with at least one publication) has increased from 39,778 to 59,074 - the image demonstrates that particularly the coverage for our most frequent authors has risen largely.

The addition of all RePEc top economists has created further opportunities for matching these items from the afore-mentioned GND Mix-n-match set, which will again will add up to the mapping. All matching and duplicates checking done, we may re-consider the option of adding the remaining frequent GND persons (>30 publications in EconBiz) automatically to Wikidata.

The mapping data can be retrieved by everyone, via SPARQL queries, by specialized tools such as wdmapper, or as part of the Wikidata dumps. What is more, it can be extended by everybody – either as a by-product of individual edits adding identifiers to persons in Wikidata, or by a directed approach. For directed extensions, any subset can be used as a starting point: Either a new version of the above mentioned ranking, or other rankings also published by RePEc, covering in particular female, or economists from e.g. Latin America; or all identifiers from a particular institution, either derived from GND or RAS. The results of all such efforts are available at once and add up continuously.

Yet, the benefits of using Wikidata cannot be reduced to the publication and maintenance of mapping itself. In many cases it offers much more than just a linking point for two identifiers:

links to Wikipedia pages about the authors, possibly in multiple languages
rich data about the authors in defined formats, sometimes with explicit provenance information
access to pictures etc. from Wikimedia Commons, or quotations from Wikiquote
links to multiple other authorities

As an example for the latter, the in total 6825 RAS identifiers in Wikidata are already mapped to 2389 VIAF and 1742 LoC authority IDs (while ORCID with 69 IDs is still remarkably low). At the same time, these RePEc-connected items were linked to 1502 English, 690 German and 272 Spanish Wikipedia pages which provide rich human-readable information.

In turn, when we take the GND persons in EconBiz as a starting point, roughly 60,000 are already represented in Wikidata. Besides large amounts of other identifiers, the according Wikidata items offer more than 33,000 links to German and more than 24,000 links to English Wikipedia pages (query).

For ZBW, “releasing” the dataset into Wikidata as a trustworthy and sustainable public database not only saves the “technical” costs of data ownership (programming, storage, operating, for access and for maintenance). Responsibility for - and fun from - extending, amending and keeping the dataset current can be shared with many other interested parties and individuals.

Wikidata for Authorities

Authority control Wikidata