Opened 9 years ago

Closed 9 years ago

#749 closed defect (fixed)

'olac-language' attribute gets ignored

Reported by: Twan Goosen Owned by: teckart@informatik.uni-leipzig.de
Priority: major Milestone: VLO-3.2
Component: VLO importer Version:
Keywords: Cc: Twan Goosen

Description

The resources in the ELRA collection (OLAC based) use the following to
represent Spanish language content:

    <language olac-language="spa">Spanish, Castilian</language>

The language code (spa) is standard ISO-639, the language name differs from
the VLO's mapped name, which is 'Spanish; Castilian' (semicolon instead of a
comma).

The problem is that in the 'language' facet, these records show up by language name, not code. Should the presence of 'olac-language' not overrule the language name in the mapping?

To see examples, go to search?fq=languageCode:name:Spanish,+Castilian and select a record. For example oai_catalogue_elra_info_ELRA_S0156 has

Notice there's no link for Spanish, as the language code has not been picked up or inferred by the importer.

Change History (10)

comment:1 Changed 9 years ago by DefaultCC Plugin

Cc: Twan Goosen added

comment:2 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Owner: set to teckart@informatik.uni-leipzig.de
Status: newassigned

comment:3 Changed 9 years ago by Twan Goosen

Related to #412?

comment:4 Changed 9 years ago by teckart@informatik.uni-leipzig.de

This is not directly related to #412.
The problem lies in the extraction process which prefers elements identified by concept links (here: content of the element language) to elements identified by XPath patterns (here: content of the attribute olac-language). Pattern-based content is only extracted when no conceptlink-based content was found.
Later the language postprocessor tries to map language names to language codes (like "German" -> "code:deu") but fails for "Spanish, Castilian" as the mapping file only contains an entry for "Spanish; Castilian".

I don't see a "great" solution for this problem. I could add a new step to the postprocesssor which replaces input values with the language code if a version with replaced commas (for semicolons) is in the mapping file.

comment:5 Changed 9 years ago by Twan Goosen

Ok, I see, so it's actually a mapping/definition problem. So the 'olac-language' attribute does not have a datacategory associated with it. If it would, and this datcat would somehow have a higher priority over the category associated with the language name, it would be picked up instead (or in addition?), is that correct?

It would be nice if we could solve this in the definitions and not with a 'hack' (or at least additional post-hoc curation), but someone would need to decide on the changing of the datacategories. If you agree (and my assumptions are correct) we can ask Dieter to have a look at this.

comment:6 in reply to:  5 ; Changed 9 years ago by teckart@informatik.uni-leipzig.de

Replying to twan.goosen@…:

Ok, I see, so it's actually a mapping/definition problem. So the 'olac-language' attribute does not have a datacategory associated with it. If it would, and this datcat would somehow have a higher priority over the category associated with the language name, it would be picked up instead (or in addition?), is that correct?

There are no priorities among datcats. Therefore it would be picked up in addition which leads to language entries "Spanish; Castilian" and "Spanish, Castilian" for the record.

It would be nice if we could solve this in the definitions and not with a 'hack' (or at least additional post-hoc curation), but someone would need to decide on the changing of the datacategories. If you agree (and my assumptions are correct) we can ask Dieter to have a look at this.

I would be very careful with modifying existing profiles afterwards (especially for a quite frequently used profile like this one) but I don't see a better solution. We would still need to find a way dealing with "contradictory" strings.

comment:7 in reply to:  6 ; Changed 9 years ago by Twan Goosen

Replying to teckart@…:

There are no priorities among datcats. Therefore it would be picked up in addition which leads to language entries "Spanish; Castilian" and "Spanish, Castilian" for the record.
(..)
I would be very careful with modifying existing profiles afterwards (especially for a quite frequently used profile like this one) but I don't see a better solution. We would still need to find a way dealing with "contradictory" strings.

We don't want to change the paradigm and redundant entries don't help improve the situation much. Maybe this should be treated like 'just another' automated curation case? So, as for organisations, we map language names to language names and/or codes, in this case ""name:Spanish, Castilian" -> "code:spa" (or, before the language code post processing, "Spanish, Castilian" -> "spa"?).

P.S. There must be more cases that call for automated curation, such as the extinct language varieties with years such as "English, Middle (1100-1500)".

comment:8 in reply to:  7 ; Changed 9 years ago by teckart@informatik.uni-leipzig.de

We don't want to change the paradigm and redundant entries don't help improve the situation much. Maybe this should be treated like 'just another' automated curation case? So, as for organisations, we map language names to language names and/or codes, in this case ""name:Spanish, Castilian" -> "code:spa" (or, before the language code post processing, "Spanish, Castilian" -> "spa"?).

Sounds good to me. As there is (AFAIK) no (external) curation procedure yet, I could implement it in the importer comparable to the organisation names. Objections?

comment:9 in reply to:  8 Changed 9 years ago by Twan Goosen

Replying to teckart@…:

Sounds good to me. As there is (AFAIK) no (external) curation procedure yet, I could implement it in the importer comparable to the organisation names. Objections?

No, sounds good to me too :)

comment:10 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: assignedclosed

Solved in trunk (r6141 & r6142)

Note: See TracTickets for help on using tickets.