20th Century Press Archives: Data donation to Wikidata

2019-10-24 by Joachim Neubert

ZBW is donating a large open dataset from the 20th Century Press Archives to Wikidata, in order to make it better accessible to various scientific disciplines such as contemporary, economic and business history, media and information science, to journalists, teachers, students, and the general public.

The 20th Century Press Archives (PM20) is a large public newspaper clippings archive, extracted from more than 1500 different sources published in Germany and all over the world, covering roughly a full century (1908-2005). The clippings are organized in thematic folders about persons, companies and institutions, general subjects, and wares. During a project originally funded by the German Research Foundation (DFG), the material up to 1960 has been digitized. 25,000 folders with more than two million pages up to 1949 are freely accessible online. The fine-grained thematic access and the public nature of the archives makes it to our best knowledge unique across the world (more information on Wikipedia) and an essential research data fund for some of the disciplines mentioned above.

The data donation does not only mean that ZBW has assigned a CC0 license to all PM20 metadata, which makes it compatible with Wikidata. (Due to intellectual property rights, only the metadata can be licensed by ZBW - all legal rights on the press articles themselves remain with their original creators.) The donation also includes investing a substantial amount of working time (during, as planned, two years) devoted to the integration of this data into Wikidata. Here we want to share our experiences regarding the integration of the persons archive metadata.

Folders from the person archive, 2015 (Credit: Max-Michael Wannags)

Folders from the persons archive, in 2015 (Credit: Max-Michael Wannags)

Linking our folders to Wikidata

The essential bit for linking the digitized folders was in place before the project even started: an external identifier property (PM20 folder ID, P4293), proposed by an administrator of the German Wikipedia in order to link to PM20 person and company folders. We participated in the property proposal discussion and made sure that the links did not have to reference our legacy Coldfusion application. Instead, we created a "partial redirect" on the purl.org service (maintained formerly by OCLC, now by the Internet Archive) for persistent URLs which may redirect to another application on another server in future. Secondly, the identifier and URL format was extended to include subject and ware folders, which are defined by a combination of two keys, one for the country and another for the topic. The format of the links in Wikidata is controlled by a regular expression, which covers all four archives mentioned above. That works pretty well - very few format errors occurred so far -, and it relieved us from creating four different archive-specific properties.

Shortly after the property creation, Magnus Manske, the author of the original Mediawiki software and lots of related tools, scraped our web site and created a Mix-n-Match catalog from it. During the following two years, more than 60 Wikidata users contributed to matching Wikidata items for humans to PM20 folder IDs.

For a start, deriving links from GND

Many of the PM20 person and company folders were already identified by an identifier from the German Integrated Authority File (GND). So, our first step was creating PM20 links for all Wikidata items which had matching GND IDs. For all these items and folders, disambiguation had already taken place, and we could safely add all these links automatically.

Infrastructure: PM20 endpoint, federated queries and QuickStatements

To make this work, we relied heavily on Linked Data technologies. A PM20 SPARQL endpoint had already been set up for our contribution to Coding da Vinci (a "Kultur-Hackathon" in Germany). Almost all automated changes to Wikidata we made are based on federated queries on our own endpoint, reaching out to the Wikidata endpoint, or vice versa, from Wikidata to PM20. In the latter case, the external endpoint has to be registered at Wikidata. Wikidata maintains a help page for this type of queries.

For our purposes, federated queries allow extracting current data from both endpoints. In the case of the above-mentioned missing_pm20_id_via_gnd.rq query, this way we can skip all items, where a link to PM20 already exists.

Within the query itself, we create a statement string which we can feed into the QuickStatements tool. That includes, for every single statement, a reference to PM20 with link to the actual folder, so that the provenance of these statements is always clear and traceable. Via script, a statement file is extracted and saved with a timestamp. Data imports via QuickStatements are executed in batch mode, and an activity log keeps track of all data imports and other activities related to PM20.

Creating missing items

After the matching of about 93 % of the person folders which include free documents in Mix-n-Match, and some efforts to discover more pre-existing Wikidata items, we decided to create the 346 missing person items, again via QuickStatements input. We used the description field in Wikidata by importing the content of the free-text "occupation" field in PM20 for better disambiguation of the newly created items. (Here a rather minimal example of such an item created from PM20 metadata.) Thus, all PM20 person folders which have digitized content were linked to Wikidata in June 2019.

Supplementing Wikidata with PM20 metadata

A second part of the integration of PM20 metadata into Wikidata was the import of missing property values to the according items. This comprised simple facts like "date of birth/death", occupations such as "economist", "business economist", "social scientist", "earth scientist", which we could derive from the "field of activity" in PM20, up to relations between existing items, e.g. a family member to the according family, or a board member to the according company. A few other source properties have been postponed, because alternative solutions exist, and the best one may depend on the intended use in future applications. The steps of this enrichment process and links to the code used - including the automatic generation of references - are online, too.

Addition to Wikidata item "Friedrich Krupp AG" (Q679201) from PM20 metadata

Complex statement added to Wikidata item for Friedrich Krupp AG

Again, we used federated queries. Often the target of a Wikidata property is an item in itself. Sometimes, we could directly get this via the target item's PM20 folder ID (families, companies); sometimes we had to create lookup tables. For the latter, we used "values" clauses in the query (in case of "occupation"), or (in case of "country of citizenship"), we have to match countries from our internal classification in advance - a process for which we use OpenRefine. Other than PM20 folder IDs, which we avoided adding when folders do not contain digitized content, we added the metadata to all items which were linked to PM20, and intend to repeat this process periodically when more items (e.g., companies) are identified by PM20 folder IDs. In some housekeeping activity, we also add periodically the numbers of documents (online and total) and the exact folder names as qualifiers to newly emerging PM20 links in items.

Results of the data donation so far

With all 5266 persons folder with digitized documents linked to Wikidata, the data donation of the person folders metadata is completed. Besides the folder links, which have already heavily been used to create links in Wikipedia articles, we have got

- more than 6000 statements which are sourced in PM20 (from "date of birth" to the track gauge of a Brazilian railway line)

- more than 1000 items, for which PM20 ID is the only external identifier

The data donation will be presented on the WikidataCon in Berlin (24.-26.10.2019) as a "birthday present" on the occasion Wikidata's seventh birthday. ZBW will further keep the digital content available, amended with a static landing page for every folder, which also will serve as source link for the metadata we have integrated into Wikidata. But in future, Wikidata will be the primary access path to our data, providing further metadata in multiple languages and links to a plethora of other external sources. And the best is, different from our current application, everybody will be able to enhance this open data through the interactive tools and data interfaces provided by Wikidata.

Participate in WikiProject 20th Century Press Archives

For the topics, wares and companies archives, there is still a long way to go. The best structure for representing these archives and their folders - often defined by the combination of a country within a geographical hierarchy with a subject heading in a deeply nested topic classification -, has to be figured out. Existing items have to be matched, and lots of other work is to be done. Therefore, we have created the WikiProject 20th Century Press Archives in Wikidata to keep track of discussions and decisions, and to create a focal point for participation. Everybody on Wikidata is invited to participate - or just kibitz. It could be challenging particularly for information scientists, and people interested in historic systems for the organization of knowledge about the whole world, to take part in the mapping of one of these systems to the emerging Wikidata knowledge graph.

Linked data Open data