Opened 11 years ago

Closed 9 years ago

Last modified 9 years ago

#412 closed enhancement (fixed)

Allow for plain <language> tagging

Reported by: knappen Owned by: teckart
Priority: minor Milestone: VLO-3.2
Component: VLO importer Version:
Keywords: Cc:

Description

Currently, languages plainly tagged with

<language>deu</language>

inside <OLAC-DcmiTerms?> aren't used by the VLO. Instead, a tagging in the style

<language olac-language="deu">deu</language>

is required to succeed.

This leads to a bad recall in the language facet.

The VLO should make use of plain language tags.

Note also that the standard tool chain (olac2cmdi.xls) creates plain language tagging and the metadata created this way remain unseen in the language facet.

Change History (14)

comment:1 Changed 11 years ago by knappen

Component: CenterRegistryVLO
Owner: changed from BSanchezRZG to keeloo

comment:2 Changed 11 years ago by teckart

Could you add a link to an affected CMDI file?

comment:3 Changed 11 years ago by knappen

Ja, zum Beispiel

http://catalog.clarin.eu/vlo/;jsessionid=EA6747C6E0342B5CC95B3040C03FE5C0?wicket:bookmarkablePage=:eu.clarin.cmdi.vlo.pages.ShowResultPage&fq=resourceType:OralCorpus&docId=oai:ehu-upv:aholab:abiadura

the language basque ist tagged as <language>baq</language>, but does not show up in the language facet. (Consulting the standard, I see that baq is a synonym for the "more standard" eus; similarly to the situation ger/deu. But this should not be the problem; the VLO should be aware of the few synonyms in ISO639 and resolve them to their expected values.)

comment:4 Changed 11 years ago by dietuyt

Priority: majorminor
Type: defectenhancement

We had this before but the content of the language element without the controlled vocabulary resulted in too much noise (people are really putting in all kind of values, including names of a language in multiple languages). Can only be fixed IMHO with a lot of checks, and is mostly relevant for OLAC records - thus lowering the priority and changing from defect to improvement.

comment:5 Changed 10 years ago by twagoo

Component: VLO web appVLO importer

comment:6 Changed 10 years ago by teckart

Resolution: fixed
Status: newclosed

Some evaluation showed that the quality of DCMI language values (like they are used in the OLAC-DcmiTerms? profile) is better than expected. At least 94% of these values in CMDI files contain acceptable language entries. Therefore support for ISO 639-2/B language codes was implemented and "http://purl.org/dc/terms/language" was added as concept to the language facet.

Prelimary solution (r4575) commited to branch vlo-3.0. Results will be evaluated in the beta phase. Will be removed if the result is worse than expected.

comment:7 Changed 10 years ago by oschonef

Just a quick thought: maybe the importer could also use the SIL ISO mapping table, to

  1. validate language codes
  2. "upgrade" two letter codes to 3-letter codes (thus, perform some normalization)

The table are available from the SIL website at http://www-01.sil.org/iso639-3/download.asp

comment:8 Changed 10 years ago by teckart

The current solution already supports ISO 639-1 + 639-3 (based on the values in the respective CMDI components) and with the last commit ISO 639-2. The problem is that we also accept language names without further validation (because of valid names like "Spanish" instead of the "offical" ISO "Spanish; Castilian" or something like "Eastern Maroon Creole" which has no ISO code).

We could of course ignore all language information which is not a standard language code, but this would probably lead to missing information for many entries (has to be evaluated though).

comment:9 Changed 10 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: closedreopened

Removed "http://purl.org/dc/terms/language" from the language facet again (r5192), as the VLO Beta showed problems with OLAC-based profiles (#554). To include it again some changes are necessary in the facet mapping procedure (basically: including patterns by default or postpone decision of their use to the actual extraction).

comment:10 Changed 10 years ago by Twan Goosen

Merged to 3.0 branch in r5195

comment:11 Changed 10 years ago by Sander Maijers

Owner: changed from keeloo to keesjan.vandelooij@mpi.nl
Status: reopenedassigned

comment:12 Changed 9 years ago by Twan Goosen

Owner: changed from keesjan.vandelooij@mpi.nl to teckart

Thomas, can you assess the status of this ticket and close it if is no longer relevant? Thanks

comment:13 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: assignedclosed

Concept "​http://purl.org/dc/terms/language" was added again to the language facets (r5990), problematic files where only the attribute contains relevant content (addressed by a pattern in the facet definition file), are still working because of modified extraction mechanism (r5985) that postpones the decision on their use to the extraction time (as mentioned in comment:9).

comment:14 Changed 9 years ago by Twan Goosen

Milestone: VLO-3.2
Note: See TracTickets for help on using tickets.