Sanitise document ID's on import

A suggestion:

Currently, document id's are taken literally from the metadata document or filename without further processing. This can lead to inclusion of characters that are troublesome to use in URL's, e.g. [?&=%]. These could be escaped (and they are) but recently we encountered an issue (on LAT Trac) with double escaping as a result of URL rewriting. These kinds of issues are hard to predict, so best to keep it simple and strip out and/or encode all such characters in a URL-friendly way (which does not mean URL encode them!)

An case example is;jsessionid=9D091DE27DA0F95A0724A6C646C7E641?wicket:bookmarkablePage=:eu.clarin.cmdi.vlo.pages.ShowResultPage&q=Pe&fq=language:English&docId=hdl:1839/00-0000-0000-0008-3C6E-A@format%253Dcmdi

where the docId is hdl:1839/00-0000-0000-0008-3C6E-A@format=cmdi. MPI-PL by policy provides such handles for metadata documents that are converted to CMDI on the fly.

Fixed in branch vlo-3.0 (r4611: Fix ticket #490: "Sanitise document ID's on import" -> replacing problematic characters with their ASCII code in underscores)

comment:4 Changed 10 years ago by Twan Goosen

This change caused #542. The following strikes me as the best solution:

(...) add a field 'selfLink' which holds the original, actual self link (which is the source of the doc ID if it is present) and use this field to populate the JSON request to the aggregator.

Edit: this has been implemented - the MdSelfLink? header now also gets mapped to a facet "_selfLink", see #542 for details

