Opened 12 years ago

Closed 9 years ago

#170 closed enhancement (fixed)

non-English language names should be ignored

Reported by: dietuyt Owned by: teckart@informatik.uni-leipzig.de
Priority: major Milestone: VLO-3.2
Component: VLO importer Version:
Keywords: Cc:

Description

Some CMDI files contain both a language name and a language code, eg:

<LanguageName xml:lang="de">Deutsch</LanguageName>
<ISO639>
  <iso-639-3-code>deu</iso-639-3-code>
</ISO639>

This leads to a double facet value: "German" and "Deutsch", like in:

http://catalog.clarin.eu/ds/vlo/?wicket:bookmarkablePage=:eu.clarin.cmdi.vlo.pages.ShowResultPage&fq=dataProvider:CMDI+Providers&fq=language:Englisch&docId=http://hdl.handle.net/11858/00-1778-0000-0005-896C-F

When the xml:lang attribute has a value that differs from "en" or "eng" the LanguageName? should be ignored upon import.

Change History (13)

comment:1 Changed 12 years ago by herste

Note, in the given example the XSD is neatly annotated with the ISOCAT datacategory, resulting in multiple xpaths. Needs some experimentation, situation may differ for different instances or components.

To clarify, as these are automatically generated XPATHS the solution is not completely trivial. (if we don't want to break a lot of stuff).

comment:2 Changed 12 years ago by dietuyt

Priority: majorminor

Hard problem that does not influence too many records.

comment:3 Changed 12 years ago by teckart

Resolution: duplicate
Status: newclosed

comment:4 Changed 11 years ago by dietuyt

Resolution: duplicate
Status: closedreopened

Of which ticket is this a duplicate?

There is an additional problem in CMDI files where the xs:lang is not specified, eg:

<Language>
<LanguageName>Nederlands</LanguageName>
<ISO639-3>nld</ISO639-3>
<Comment>NeDia</Comment>
</Language>

I see several possibe solutions

  • ignore LanguageName? completely when one code and one name are available
  • have a large mapping list with known language names that map to the canonical English name, eg Nederlands > Dutch, Deutsch > German (could be part of the import or of a curation module that runs beforehand)


comment:5 Changed 11 years ago by sanmai

I agree about ignoring LanguageName?. Shouldn't such a mapping be to the language code instead of to common or canonical names? No language identifier is as canonical as an ISO 639-3 code. I would say, there is only need for an ISO 639-3 code. Any further descriptives can be sourced from the SIL code table and that data even changes from time to time.

comment:6 Changed 11 years ago by dietuyt

For certain languages (e.g. many of the ones studied at the MPI) no ISO-639-3 identifier exists. So we need a fallback in case no code is given, but for most of the CMDI records there is such a code, and in that case the language name can be ignored.

comment:7 Changed 11 years ago by dietuyt

Priority: minormajor

Boosting priority as this can be relatively easily fixed (for the case when xml:lang is there) and more and more records result in double language entries.

comment:8 Changed 10 years ago by twagoo

Component: VLO web appVLO importer

comment:9 Changed 10 years ago by latadmin@mpi.nl

Owner: changed from herste to Twan Goosen
Status: reopenedassigned

comment:11 Changed 10 years ago by Twan Goosen

Owner: changed from Twan Goosen to teckart@informatik.uni-leipzig.de

comment:12 Changed 10 years ago by Twan Goosen

Milestone: VLO-3.1

comment:13 Changed 9 years ago by Twan Goosen

Milestone: VLO-3.1VLO-3.2

Splitting 3.1 milestone. Most open tickets go to 3.2 so that we can have a release on the short term.

comment:14 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: assignedclosed

Fixed in trunk (r6150)

Note: See TracTickets for help on using tickets.