Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#490 closed enhancement (fixed)

Sanitise document ID's on import

Reported by: twagoo Owned by: teckart
Priority: minor Milestone: VLO-3.0
Component: VLO importer Version:
Keywords: Cc: twagoo

Description (last modified by twagoo)

A suggestion:

Currently, document id's are taken literally from the metadata document or filename without further processing. This can lead to inclusion of characters that are troublesome to use in URL's, e.g. [?&=%]. These could be escaped (and they are) but recently we encountered an issue (on LAT Trac) with double escaping as a result of URL rewriting. These kinds of issues are hard to predict, so best to keep it simple and strip out and/or encode all such characters in a URL-friendly way (which does not mean URL encode them!)

An case example is;jsessionid=9D091DE27DA0F95A0724A6C646C7E641?wicket:bookmarkablePage=:eu.clarin.cmdi.vlo.pages.ShowResultPage&q=Pe&fq=language:English&docId=hdl:1839/00-0000-0000-0008-3C6E-A@format%253Dcmdi

where the docId is hdl:1839/00-0000-0000-0008-3C6E-A@format=cmdi. MPI-PL by policy provides such handles for metadata documents that are converted to CMDI on the fly.

Change History (4)

comment:1 Changed 10 years ago by DefaultCC Plugin

Cc: twagoo added

comment:2 Changed 10 years ago by twagoo

Description: modified (diff)

comment:3 Changed 10 years ago by teckart

Resolution: fixed
Status: newclosed

Fixed in branch vlo-3.0 (r4611: Fix ticket #490: "Sanitise document ID's on import" -> replacing problematic characters with their ASCII code in underscores)

comment:4 Changed 10 years ago by Twan Goosen

This change caused #542. The following strikes me as the best solution:

(...) add a field 'selfLink' which holds the original, actual self link (which is the source of the doc ID if it is present) and use this field to populate the JSON request to the aggregator.

Edit: this has been implemented - the MdSelfLink? header now also gets mapped to a facet "_selfLink", see #542 for details

Last edited 10 years ago by Twan Goosen (previous) (diff)
Note: See TracTickets for help on using tickets.