Wikidata is a large database, which connects all of the roughly 300 Wikipedia projects. Besides interlinking all Wikipedia pages in different languages about a specific item – e.g., a person -, it also connects to more than 1000 different sources of authority information.
The linking is achieved by a „authority control“ class of Wikidata properties. The values of these properties are identifiers, which unambiguously identify the wikidata item in external, web-accessible databases. The property definitions includes an URI pattern (called „formatter URL“). When the identifier value is inserted into the URI pattern, the resulting URI can be used to look up the authoritiy entry. The resulting URI may point to a Linked Data resource - as it is the case with the GND ID property. This, on the one hand, provides a light-weight and robust mechanism to create links in the web of data. On the other hand, these links can be exploited by every application which is driven by one of the authorities to provide additional data: Links to Wikipedia pages in multiple languages, images, life data, nationality and affiliations of the according persons, and much more.
Wikidata item for the Indian Economist Bina Agarwal, visualized via the SQID browser
In 2014, a group of students under the guidance of Jakob Voß published a handbook on "Normdaten in Wikidata" (in German), describing the structures and the practical editing capabilities of the the standard Wikidata user interface. The experiment described here focuses on persons from the subject domain of economics. It uses the authority identifiers of the about 450,000 economists referenced by their GND ID as creators, contributors or subjects of books, articles and working papers in ZBW's economics search portal EconBiz. These GND IDs were obtained from a prototype of the upcoming EconBiz Research Dataset (EBDS). To 40,000 of these persons, or 8.7 %, a person in Wikidata is connected by GND. If we consider the frequent (more than 30 publications) and the very frequent (more than 150 publications) authors in EconBiz, the coverage increases significantly:
|Number of publications||total||in Wikidata||percentage|
|Datasets: EBDS as of 2016-11-18; Wikidata as of 2016-11-07 (query, result)|
|> 0||457,244||39,778||8.7 %|
|> 30||18,008||3,232||17.9 %|
|> 150||1,225||547||44.7 %|
These are numbers "out of the box" - ready-made opportunities to link out from existing metadata in EconBiz and to enrich user interfaces with biographical data from Wikidata/Wikipedia, without any additional effort to improve the coverage on either the EconBiz or the Wikidata side. However: We can safely assume that many of the EconBiz authors, particularly of the high-frequency authors, and even more of the persons who are subject of publications, are "notable" according the Wikidata notablitiy guidelines. Probably, their items exist and are just missing the according GND property.
To check this assumption, we take a closer look to the Wikidata persons which have the occupation "economist" (most wikidata properties accept other wikidata items - instead of arbitrary strings - as values, which allows for exact queries and is indispensible in a multilingual environment). Of these approximately 20,000 persons, less than 30 % have a GND ID property! Even if we restrict that to the 4,800 "internationally recognized economists" (which we define here as having Wikipedia pages in three or more different languages), almost half of them lack a GND ID property. When we compare that with the coverage by VIAF IDs, more than 50 % of all and 80 % the internationally recognized Wikidata economists are linked to VIAF (SPARQL Lab live query). Therefore, for a whole lot of the persons we have looked at here, we can take it for granted the person exists in Wikidata as well as in the GND, and the only reason for the lack of a GND ID is that nobody has added it to Wikidata yet.
As an aside: The information about the occupation of persons is to be taken as a very rough approximation: Some Wikidata persons were economists by education or at some point of their career, but are famous now for other reasons (examples include Vladimir Putin or the president of Liberia, Ellen Johnson Sirleaf). On the other hand, EconBiz authors known to Wikidata are often qualified not as economist, but as university teacher, politican, historican or sociologist. Nevertheless, their work was deemed relevant for the broad field of economics, and the conclusions drawn at the "economists" in Wikidata and GND will hold for them, too: There are lots of opportunities for linking already well defined items.
What can we gain?
The screenshot above demonstrates, that not only data about the person itself, her affiliations, awards received, and possibly many other details can be obtained. The "Identifiers" box on the bottom right shows authoritiy entries. Besides the GND ID, which served as an entry point for us, there are links to VIAF and other national libraries' authorities, but also to non-library identifier systems like ISNI and ORCID. In total, Wikidata comprises more than 14 million authority links, more than 5 millions of these for persons.
When we take a closer look at the 40,000 EconBiz persons which we can look up by their GND ID in Wikidata, an astonishing variety of authorities is addressed from there: 343 different authorities are linked from the subset, ranging from "almost complete" (VIAF, Library of Congress Name Authority File) to - in the given context- quite exotic authorities of, e.g., Members of the Belgian Senate, chess players or Swedish Olympic Committee athletes. Some of these entries link to carefully crafted biographies, sometimes behind a paywall (Notable Names Database, Oxford Dictionary of National Biography, Munzinger Archiv, Sächsische Biographie, Dizionario Biografico degli Italiani), or to free text resources (Project Gutenberg authors). Links to the world of museums and archives are also provided, from the Getty Union List of Artist Names to specific links into the British Museum or the Musée d'Orsay collections.
A particular use can be made of properties which express the prominence of the according persons: Nobel Prize IDs, for example, definitivly should be linked to according GND IDs (and indeed, they are). But also TED speakers or persons with an entry in the Munzinger Archive (a famous and long-established German biographical service) are assumed to have GND IDs. That opens a road to a very focused improvement of the data quality: A list of persons with that properties, restricted to the subject field (e.g., "occupation economist"), can be easily generated from Wikidata's SPARQL Query Service. In Wikidata, it is very easy to add the missing ID entries discovered during such cross-checks interactively. And if it turns out that an "very important" person from the field is missing from the GND at all, that is a all-the-more valuable opportunity to improve the data quality at the source.
How can we start improving?
As a prove of concept, and as a practical starting point, we have developed a micro-application for adding missing authority property values. It consists of two SPARQL Lab scripts: missing_property creates a list of Wikidata persons, which have a certain authority property (by default: TED speaker ID) and lacks another one (by default: GND ID). For each entry in the list, a link to an application is created, which looks up the name in the according authority file (by default: search_person, for a broad yet ranked full-text search of person names in GND). If we can identify the person in the GND list, we can copy its GND ID, return to the first one, click on the link to the Wikidata item of the person and add the property value manually through Wikidata's standard edit interface. (Wikidata is open and welcoming such contributions!) It takes effect within a few seconds - when we reload the missing_property list, the improved item should not show up any more.
Instead of identifying the most prominent economics-related persons in Wikidata, the other way works too: While most of the GND-identified persons are related to only one or twe works, as an according statistics show, few are related to a disproportionate amount of publications. Of the 1,200 persons related to more than 150 publications, less than 700 are missing links to Wikidata by their GND ID. By adding this property (for the vast majority of these persons, a Wikidata item should already exist), we could enrich, at a rough estimate, more than 100,000 person links in EconBiz publications. Another micro-application demonstrates, how the work could be organized: The list of EconBiz persons by descending publication count provides "SEARCH in Wikidata" links (functional on a custom endpoint): Each link triggers a query which looks up all name variants in GND and executes a search for these names in a full-text indexed Wikidata set, bringing up an according ranked list of suggestions (example with the GND ID of John H. Dunning). Again, the GND ID can be added - manually but straightforward - to an identified Wikidata item.
While we can not expect to reduce the quantitative gap between the 450,000 persons in EconBiz and the 40,000 of them linked to Wikidata significantly by such manual efforts, we surely can step-by-step improve for the most prominent persons. This empowers applications to show biographical background links to Wikipedia where our users expect them most probably. Other tools for creating authority links and more automated approaches will be covered in further blog posts. And the great thing about wikidata is: All efforts add up - while we are doing modest improvements in our field of interest, many others do the same, so Wikidata already features an impressive overall amont of authority links.
PS. All queries used in this analysis are published at GitHub. The public Wikidata endpoint cannot be used for research involving large datasets due to its limitations (in particular the 30 second timeout, the preclusion of the "service" clause for federated queries, and the lack of full-text search). Therefore, we’ve loaded the Wikidata dataset (along with others) into custom Apache Fuseki endpoints on a performant machine. Even there, a „power query“ like the one on the number of all authority links in Wikidata takes about 7 minutes. Therefore, we publish the according result files in the GitHub repository alongside with the queries.