On the 5th of March 2016 we organize an Open Data Day on Amsterdam in the Golden Age. The meeting will take place in the Amsterdam Museum. The Open Data Day is a workshop and codefest that is open to everyone. More information (in Dutch) on:
Of the twelve thousand creators and depicted persons in the Amsterdam Museum collection, we were able to link 465 to one of the twenty thousand biographies in the Ecartico database. This post describes the way we dit it.
First, 465 doesn’t sound like an awful lot, considering the twelve thousand persons on the one side and more than twenty thousand on the other. Please keep in mind the Amsterdam Museum collection spans over five centuries, while Ecartico focusses on artists working in 17th-century Amsterdam. By nature, false negatives are hard to find, but the fact that we couldn’t think of one person that should’ve been matched and wasn’t suggests we didn’t do a bad job.
The fields to match on were: name, date of birth, date of death and RKD URI (more on RKD URI’s in a previous post). We considered ‘profession’ as well, but at the Amsterdam Museum this field was left empty most of the time and mapping ‘persons that created objects tagged as silverware’ to the term ‘silversmith’ would have been time-consuming.
Not knowing what combinations would prove to yield the best results, we decided to match on each of these fields in itself and save the results in a matrix. So people that were born in the same year en died in the same year as well scored a ‘1’ in the birth and death year column of the matrix, regardless of their names.
The RKD URI proved to be the best, and only, one-field-matcher. It leaves no room for ambiguity, so only human error might result in mistakes here.
Matching persons on their exact names yielded a considerable amount of false positives because there’s more than one Jan Jansen and some fathers, like Romeyn de Hooghe, name their sons after themselves. The number of false negatives was much higher – ‘Rembrandt’ just wouldn’t match with ‘Rembrandt Harmensz. van Rijn’.
Obviously, on a fuzzy name search, the number of false negatives was much lower, but the the number of false positives went skyrocketing into the thousands. However, the combination of a fuzzy-name-match and a birth-and-death-year-match (just years, not exact dates, since many records just had years) did the job very well. We were able to identify just one false positive: a Jan Claesz (1570-1618) matching Nicolaes Jansz. Wytmans, Claes Wijtmans (1570-1618).
The final results:
- 280 matches on RKD URI alone
- 181 additional matches on the fuzzy-name-match and a birth-and-death-year-match combination
- 3 additional matches on a exact-name-match and a birth-and-death-year-match combination (implies a slight error in the fuzzy search!)
- 1 manual match: `Neeltje Willemsdr. van Zuijdtbrouck` marked as matching `Rembrandt’s moeder`.
On Wednesday 14 January 2015 we organize an expert meeting about “biographical augmentation of collection databases”.
Within the CANAAN project we investigate the possibilities of augmenting a museum collection database with (structured) biographical data from external resources. As several posts on this blog testify this is not as straight forward as it might seem. In the workshop we will present the results we have achieved so far and an overview of the challenges we have encountered. Then we will open the floor for e general discussion about the perils, possibilities and prospects of linking collection databases to external resources.
This workshop is targeted at specialists in the field of collection databases, biographical databases and digitization in the cultural heritage domain. If you are interested in participating, please send an email to firstname.lastname@example.org.
Location: Amsterdam Museum
Everyone who ever tried tot find a certain person across different datasets knows names are not always the best thing to identify a person with. Different persons go by the same name, one person can go by different names and even the same name can be written in numerous ways: ‘Rembrandt van Rijn’ = ‘Rijn, Rembrandt van’ = ‘Rhijn, Rembrandt Harmensz. van’ = ‘Rembrant’.
To address this problem the Amsterdam Museum uses the exact spelling the RKD (Rijksbureau voor Kunsthistorische Documentatie) uses. As do other collection owners, such as the Rijksmuseum. The Ecartico database, regrettably, does not. So that is not going to help us if we want to trace persons across these two datasets. However, in Ecartico, almost 1500 persons are linked to their RKDartists URI.
By hand, the Amsterdam Museum already linked 686 persons to their RKDartists URI. And just a month ago, the RKD launched an OpenSearch interface, that made it possible to automatically get the RKDartists URIs for another 2108 persons. As yet, the RKD OpenSearch doesn’t seem to be documented, but if you want to search on artist names a query like https://rkd.nl/nl/opensearch-eac-cpf?q=naamdeel:rembrandt&startIndex=1&count=10 will probably get you what you want.
There were some false negatives – of 81 persons that were given a RKDartists URI by the museum staff we couldn’t find their exact name in the RKD OpenSearch. By the looks of it, most of these cases were triggered by a different spelling of names.
There were some false positives as well, always occurring when different persons went by the same name. The birth and death dates of 1461 persons in the museum collection were held against the RKDartists dates. Twentyseven of them, or 1.8 percent, didn’t match.
The false positives are illustrated by Louis Le Comte, a 17th-century French Jesuit whose Beschryvinge van het machtige keyserryk China we find in the collection of the museum. In this case, museum staff didn’t give birth or death dates. In RKDartists, we find another Louis Le Comte – a 19th-century Dutch author and amateur artist working for the Royal Navy. You can’t expect museum staff to check author names in RKDartists, and you can’t blame the RKD for not knowing French Jesuit authors. Using URI’s more often might be a solution.
The CANAAN projects aim being to mashup data in a single research tool, we could have just taken dumps from both datasets to start building from there. However, to make the data more accessible, for both team members and other researches, we started by building an API on the Ecartico database (please note the api is still in development).
With the API we especially want to accomodate the matching of persons across datasets. Searching for specific persons is possible by name (full name, surname, first name, patronym), gender, birth- and deathdate, gender and birth- and deathplace. The api returns these fields and the Ecartico URI for each person found. Saving the Ecartico URI should give you permanent access to up-to-date biographical data about a person (for now only available as html, but in a next phase as json and rdf as well).
Ecartico already matched a lot of people in its datasets with biographical data elsewhere (most notably in Biografisch Portaal and RKD Artists). Since these URI’s are sometimes even better at identifying people than their actual names (should you search for ‘maas’ or ‘maes’ or ‘masius’?), the api makes it possible to search for URI’s: http://www.vondel.humanities.uva.nl/ecartico/api/searchURI/?uri=https://rkd.nl/explore/artists/66219.
To get all persons linked to their RKD counterparts, search for domain: http://www.vondel.humanities.uva.nl/ecartico/api/searchURI/?domain=https://rkd.nl/explore/.
The Amsterdam Museum collection, managed in Adlib collection management software, is made available online through a web-interface and through the Adlib API. As explained on the museum’s open data page the collection is also available throuhg OAI-MHP and as Linked Open Data, but these sources are not kept up-to-date (LOD based on a 2011 dump, OAI-MHP not updated after june 2013) and in the OAI-MHP some information has been lost while mapping fields to their Dublin Core equivalents.
In the web-interface it’s possible to use a persons name as searchterm, but there’s no such thing as a ‘person page’ as meant in ‘Getting Ready – presenting data in a usable way‘. The closest thing to that are the search results on an exact name (get there by searching on an exact name or clicking a persons name in an item details page). In the results it is unclear whether an object is tagged with this person, made by this person or has a different relationship with this person. No further biographical data of the person is shown.
The Adlib API is somewhat more rewarding. The persons database has been made available (see http://amdata.adlibsoft.com/adlibweb.xml for all available databases and fields) and can be queried in different ways. Information about a single person can easily be retrieved once you know that persons priref (id): http://amdata.adlibsoft.com/wwwopac.ashx?database=AMperson&search=priref=13978. Not knowing the priref you can search by name: http://amdata.adlibsoft.com/wwwopac.ashx?database=AMperson&search=name=rembrandt*.
Regrettably, the api returns just the bare essentials of a person: name, date of birth, date of death and some other fields. The urls that lead to information about a person elsewhere (RKD, mostly) are not returned. More serious: the objects the person is linked to are omitted. The only way to get the linked objects I found so far, is to query the collection database and search for the exact name of the creator: http://amdata.adlibsoft.com/wwwopac.ashx?database=AMcollect&search=creator=%27Elsken,%20Ed%20van%20der%27 (don’t forget to put quotes around searchstrings containing spaces or comma’s and urlencode the string).
Persons in the Amsterdam Museum collection – a prototype
We made a prototype of the kind of person page envisioned in ‘Getting Ready – presenting data in a usable way‘. The data was extracted from the Adlib API (we retrieved all object records to get the – unique – names of persons mentioned as creator or as subject) and from an xml-export of the persons database the museum sent us by email. For the latter we could have used the Adlib Api to retrieve all persons, but then we wouldn’t have had the urls to information about persons elsewhere. And these urls are important to us, because they can be used as identifiers when names are not unique within or not the same across sources.
In the prototyped application all information about a person is gathered on a single page: biographical data, urls to information about the person elsewhere, objects the person created and objects that were tagged with the person as subject. The person page has its own permanent URL that can serve as URI. In the future information about this person from other sources (Wikipedia, RKD, etc.) could be shown here as well. As could representations of the data in rdf or json.
The Ecartico database holds biographical data concerning painters, engravers, printers, book sellers, gold- and silversmiths and others involved in the ‘cultural industries’ of the Low Countries in the sixteenth and seventeenth centuries. The Amsterdam Museum knows a lot about the objects in its collection, including who created those objects or were portrayed on them. Matching persons over the two datasets will make it possible to answer questions like “what was the religion of painters painting schuttersstukken?”
To make this matching (and linking, if matched) possible, both institutions should present their data in a usable and accessible way – both for humans and machines. In doing so, other researchers and developers will be able to use the data for their own purposes as well.
In the ideal world, as shown in fig. 1
- each person, in both datasets, is presented on its own URI
- each person, in both datasets, is presented in a human-readable (html) and machine-readable (json, rdf) format
- both datasets have a search-api to find persons by name, date of birth, etc.
- since URI’s are good at identifying things, the search api’s will let you find persons by URI as well
- the museum has a search-api for objects as well.
Of course, the ideal world has yet to materialize. In a next post more on the current way persons are presented in the Amsterdam Museum dataset.