Integrating the PM20 companies archive: part 3 of the data donation to Wikidata

ZBW inherited a large trove of historical company information - annual reports, newspaper clippings and other material about more than 40,000 companies and other organizations around the world. Parts of these, in particular all about German und British entities until 1949, are available free and online in the companies section (list by country) of the 20th Century Press Archives. More digitized folders with material about companies in and outside of Europe up to 1960 are accessible only on ZBW premises, due to intellectual property rights.

As a part of its support for Open Science, ZBW has made all metadata of the 20th Century Press Archives available under a CC0 license. In order to make the folders more easily accessible for business history research as well as for the general public, we have added links for every single folder to Wikidata. In addition to that, the metadata about companies and organizations, such as inception date or links to board members, has been added to the large amount of company data already available in Wikidata. This continues the PM20 data donation of ZBW to Wikidata, as described earlier for the persons archives and the countries/subjects archives. The activities were carried out - with notable help of volunteers - and documented in the WikiProject 20th Century Press Archives.

The mapping process to Wikidata items

Many of the PM20 company and organization folders deal with existing items in Wikidata. If GND identifiers were assigned to these items, we directly created links to PM20 companies with the same id, and were done. Matching and linking to Wikidata items without the help of a unique identifier however provided some challenge. Different from person names, company names change frequently, or are spelled differently in different times or languages. Not too uncommon, the entities themselves change through mergers and acquisitions, and may or may not have been represented by a new folder in PM20, or by a different item in Wikidata. Subsidiaries may be subsumed under the parent organization, or be separate entities. While it is relatively easy to split items in Wikidata, in the folders with printed newspaper clippings and reports it meant digging through sometimes hundreds of pages to single out a company retrospectively. So early decisons about the cutting and delimitation of folders often stuck for the following decades. All of that made it more difficult not only to obtain matches at all, but also to decide if indeed the same entity is covered.

For the first matching approach, we used the Wikidatas Mix-n-match (M-n-m) tool. In order to get manageable "buckets", we sliced the data according to the main language of the company location (German, English, French, Dutch and Other). In the M-n-m batches, we aimed at entities of type "organization" in that language. Despite the fact that we also used the available aliases, we found relatively low matching rates (and a number of "false positives" among them).

After having worked through the M-n-m suggestions, we switched to another approach: For each segment, we created a list of search statements for all folders not already linked in Wikidata. For each entry, the company name was searched via Startpage (which in turn uses Google search), supplemented by "site:wikipedia.org". That searches all Wikipedia pages as full text, so slight differences in the spelling of company names did not matter. Also Wikipedia pages turned up where the company name only occurred in some context, e.g. for the founder of the company or as part of a later merger. We now could select the correct Wikipedia page from the result list, follow the "Wikidata item" link and add the PM20 folder ID to the item. Another search link on the list searched "site:wikidata.org" for existing Wikidata items. (It turned out that for Wikidata, Duck-duck-go brought better results than Startpage/Google.)

When no exactly matching item was found, we sometimes added the PM20 link with a "mapping relation type" of "related match", according to the perceived usefulness for later more detailed work.

The tedious work was facilitated with a second list, which contained statements for creating missing items immediately in Wikidata's QuickStatements  (QS) tool. The statements included labels in different languages, descriptions, sometimes aliases, the official name (with the leagal form), type(s) and often GND ID, as in this example:

# Steel Brothers & Company {19}

CREATE
LAST|Lde|"Steel Brothers & Company"
LAST|Len|"Steel Brothers & Company"
LAST|Dde|"Unternehmen; Kolonialgesellschaft"
LAST|Den|"business; colonial society"
LAST|Ade|"W. Strang Steel & Co"
LAST|Aen|"W. Strang Steel & Co"
LAST|P4293|"co/068007"
LAST|P31|Q4830453|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P31|Q1700154|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P227|"2040532-7"|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P571|+1870-01-01T00:00:00Z/9|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P1448|de:"Steel Brothers & Company, Ltd."|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11


Since both lists followed exactly the same order (primarily by descending number of documents, to put the most relevant companies on top) and were updated every hour, the workflow was easy: step through the list, search for existing items and link them, add the remaining entries one by one or in batches to Wikidata via QS, and repeat until all entries are linked, and both lists are empty. In result, 3897 PM20 folders could be linked to existing items, while 5085 items were created from scratch. (Query code for the search list, the insert list, and the conversion to QS statements are available.)

Enriching the metadata

After the mapping process had been finished, we added missing metadata from PM20 to all linked Wikidata items. That included country and headquarter location (having Geonames identifiers in PM20 helped a lot), inception and dissolution dates, links to predecessor and parent companies, or links to persons in their role as founder or board members.

Table of organization properties sourced in PM20:

PID Property Pre-existing Items New Items Total
P452 industry 5509 7682 13191
P17 country 769 5105 5874
P31 instance of 424 5371 5795
P1448 official name 94 5073 5167
P159 headquarters location 722 3542 4264
P571 inception 371 3800 4171
P227 GND ID 816 1708 2524
P355 subsidiary 764 191 955
P749 parent organization 384 567 951
P576 dissolved, abolished or demolished date 204 673 877
P156 followed by 331 424 755
P155 follows 538 209 747
P3320 board member 460 35 495
P112 founded by 78 22 100
P5052 supervisory board member 63 20 83

Source

Classification by industry

The PM20 companies archive was organized by industries, in two different ways: firstly, a custom classification was used for all folders, derived from an ancient version of the "economic sectors" part of the STW Thesaurus for Economics. Secondly, parts of the folders were classified according to the European economic activities classification NACE Rev. 2.

Here, the approach was to map the custom classification to existing - and a few newly built - industry items in Wikidata (see mapping). This allowed to fill the "industry" property of all linked Wikidata items with values derived from PM20. Additionally, further matching industries were derived from the "NACE code" property in Wikidata. Interestingly, this combined approach extended the coverage of companies folders by NACE significantly - from 3,648 to 6,233.

Due to incompatibilities on the conceptual level, that could not be extended to all industries. To give an example: One of the most important industry sectors in Germany, "Metallinstustrie" (metal industry), cannot be represented by a NACE class: "C24 Manufacture of basic metals" is strictly separate from "C25 Manufacture of fabricated metal products", while further processing of metals, e.g. machinery and equipment, are assigned to still other classes.

Supplemented with "plain" Wikidata industries, it proved nevertheless possible to create a complete hierarchical list of companies with PM20 folders by NACE code, and in absence of a NACE code, by Wikidata industry label.

chart of PM20 industries

(Wikidata query result with mouse-over labels)

As a result of this data donation, the coverage of 20th century companies and organizations in Wikidata has improved considerably, both in width and depth. With the links to the digitized PM20 folders, about 1.2 million document pages has been made available from the according items for FAIR use in research, education and public information.