Rich Interlinked Text Specification (proposal)

2025-06-07

Richard Palmer

Tags:

I think this is an issue unresolved in all cultural heritage collection websites (but please let me know if otherwise) and yet it's such a small basic web feature - the ability to link to other object pages in a collection from the text within an object page. For example in the V&A collection a record's description might say:

This design is closely related to those of contemporary orange-houses, with six bays in each range. It relates closely to Plan 2 (E.419-1951).

(an example V&A object)

But that reference (E.419-1951) to the other object in the V&A's collection is not helpfully linked through to the other object page, so the user would have to start a new search in the collection site to find it, which seems a poor experience. Of course there are usually other fields in the record where direct links between objects (or other relationships such as to controlled vocabularies terms for faceted search) can be generated, so why not within the text?

Essentially it's a systems pipeline problem (I feel sure there is a much better word to describe this situation, please let me know!). The collection management system where curators and cataloguers are writing the object information is often not the same system where the records are made available as web pages, instead the data from one system is passed on and transformed into HTML by a second system. This means the different fields in the object record need to be transformed into HTML, and this is handled depending on the type of the field - which are mainly free-text fields or controlled vocabulary fields (with other minor variants I'm ignoring for this post). If it's a controlled vocabulary field, based on knowing:

the field type (that it is a controlled vocabulary managed field)
the field meaning (e.g. is it a controlled vocabulary for materials, or places, or artist/makers, etc)
the field value (as the controlled vocabulary identifier, e.g. AAT300045514)
the field value (as the controlled vocabulary displayable name, e.g. Jet)

when generating the HTML for this field, we can show the displayable name ("Jet") in the page of course, but we can also generate the URL to link to a faceted search for that material (or other controlled vocabulary term). The URL will vary depending on the website's site and architecture but it would be something along the lines of:

/search?[controlled vocabulary]=[controlled vocabulary term identifier]

or less abstractly, for the V&A Collection site:

/search?id_material=AAT300045514

So then the user doesn't have to initiate the search themselves (and avoids the user doing a gengeral text search for 'Jet' which would return objects using the word in multiple meanings, Jet engine, Jet set, etc.

But for free text fields the situation is different, going back to the first example:

This design is closely related to those of contemporary orange-houses, with six bays in each range. It relates closely to Plan 2 (E.419-1951).

There is no information that can be used to generate a URL automatically from the quoted museum accession number appearing within the text, primarily because we don't even know it is a museum accession number. We could write some regular expression rules to try to identify this, but (certainly for the V&A) there are many different ways these numbers can be written, sometimes just appearing as a single number, so we would likely create many erroneous links if we turned every instance of a number in a free-text field into a link to an object page.

Also, somewhat unpleasingly to a tidy mind, even if we did turn the museum accession number into a link, we can't actually link directly to the object page, we would just have to link to a text search for it, as our URLs use the collection management system identifier for object records, but the museum accesion number for the physical object (for reasons too long to get into in this already too long post). So the link would just be:

/search?q=E.419-1951

rather than:

/item/O195744

which again would seem less than ideal (although admittedly not too terrible a crime, as presumably the search would return the object as well, but with one extra click for the user each time to reach it).

Bad proposal

A simple solution would be for curators and cataloguers to enter the URL for the related object into the object record directly like so:

<a href="http://example.org/object/O1234">

while this works, it's hardcoding into the object record information about a different system (the website) which may change over time, which might cause the URLs to break (obivously this should not happen, and redirects should be put into place to handle this - but still, it seems bad to hardcode architectural assumptions about one system into another system). It also requires the user to write some more HTML tags correctly by hand (unless a Collection Management System vendor implements it in their text editor to handle this) which creates the risk of tags not being closed, breaking the object page.

Better(?) Proposal

Fundamentally the issue then is that some (system architecture?) knowledge is not passed on between the two systems in the pipeline, knowledge which would tell the object page generater system how it could handle museum accession numbers within free text in some better way. To try to avoid creating some complex new standard for resolving this, and because HTML typographic tags are often used to pass on instructions around how text should be shown (e.g. for italics or bold) a proposal would be the use of the <data> HTML tag:

"The <data> HTML element links a given piece of content with a machine-readable translation."

(to quote from MDN - https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/data)

so that feels a reasonable usage that we could including the knowledge of the object record identifier (for generating the URL) alongside the museum accession number (for humans to read). The free-text field would then be written as:

<data value="O1234">A.123&lt/data>

Which would have no impact on a page generation system that doesn't handle it (as the data element doesn't alter the text's presentation - although potentially the whole HTML could appear in raw form as above if HTML tags are not parsed/removed). But for an object page generation system that did know how to handle this, it could turn this into a direct link into the associated object record.

Admittedly, this has the same issues as writing in the HTML link directly, that is it requires the curator/cataloguer to type in a HTML tag correctly (unless the collection management system vendor implements something in the text editor to insert them for the user), but this time they are not hardcoding in a URL for another system, instead they are just putting in the identifier for the corresponding object record in the same system they are writing the current object record.

Extending further

Taking this further, the data element could also have some custom attributes (using data-) which allows for a variety of different links to be generated in the page building system, for example:

links to controlled identifier faceted search - perhaps data-controlled-field="materials" to indicate a link to the materials facet)
links to other cataloguing systems such as the library or archive catalogue
links to map co-ordinates
links to other institutions systems (but then how do we know how to build the right URL for the other institution?)

But perhaps that is something that needs more standardisation, to avoid a mess of different custom attributes across systems. Possibly an area for an organisation like Collections Trust to take up? (because there are no other problems around of course!)