Opened 12 years ago
Closed 9 years ago
#170 closed enhancement (fixed)
non-English language names should be ignored
Reported by: | dietuyt | Owned by: | teckart@informatik.uni-leipzig.de |
---|---|---|---|
Priority: | major | Milestone: | VLO-3.2 |
Component: | VLO importer | Version: | |
Keywords: | Cc: |
Description
Some CMDI files contain both a language name and a language code, eg:
<LanguageName xml:lang="de">Deutsch</LanguageName> <ISO639> <iso-639-3-code>deu</iso-639-3-code> </ISO639>
This leads to a double facet value: "German" and "Deutsch", like in:
When the xml:lang attribute has a value that differs from "en" or "eng" the LanguageName? should be ignored upon import.
Change History (13)
comment:1 Changed 12 years ago by
comment:2 Changed 12 years ago by
Priority: | major → minor |
---|
Hard problem that does not influence too many records.
comment:3 Changed 12 years ago by
Resolution: | → duplicate |
---|---|
Status: | new → closed |
comment:4 Changed 11 years ago by
Resolution: | duplicate |
---|---|
Status: | closed → reopened |
Of which ticket is this a duplicate?
There is an additional problem in CMDI files where the xs:lang is not specified, eg:
<Language> <LanguageName>Nederlands</LanguageName> <ISO639-3>nld</ISO639-3> <Comment>NeDia</Comment> </Language>
I see several possibe solutions
- ignore LanguageName? completely when one code and one name are available
- have a large mapping list with known language names that map to the canonical English name, eg Nederlands > Dutch, Deutsch > German (could be part of the import or of a curation module that runs beforehand)
comment:5 Changed 11 years ago by
I agree about ignoring LanguageName?. Shouldn't such a mapping be to the language code instead of to common or canonical names? No language identifier is as canonical as an ISO 639-3 code. I would say, there is only need for an ISO 639-3 code. Any further descriptives can be sourced from the SIL code table and that data even changes from time to time.
comment:6 Changed 11 years ago by
For certain languages (e.g. many of the ones studied at the MPI) no ISO-639-3 identifier exists. So we need a fallback in case no code is given, but for most of the CMDI records there is such a code, and in that case the language name can be ignored.
comment:7 Changed 11 years ago by
Priority: | minor → major |
---|
Boosting priority as this can be relatively easily fixed (for the case when xml:lang is there) and more and more records result in double language entries.
comment:8 Changed 10 years ago by
Component: | VLO web app → VLO importer |
---|
comment:9 Changed 10 years ago by
Owner: | changed from herste to Twan Goosen |
---|---|
Status: | reopened → assigned |
comment:11 Changed 10 years ago by
Owner: | changed from Twan Goosen to teckart@informatik.uni-leipzig.de |
---|
comment:12 Changed 10 years ago by
Milestone: | → VLO-3.1 |
---|
comment:13 Changed 9 years ago by
Milestone: | VLO-3.1 → VLO-3.2 |
---|
Splitting 3.1 milestone. Most open tickets go to 3.2 so that we can have a release on the short term.
comment:14 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Fixed in trunk (r6150)
Note, in the given example the XSD is neatly annotated with the ISOCAT datacategory, resulting in multiple xpaths. Needs some experimentation, situation may differ for different instances or components.
To clarify, as these are automatically generated XPATHS the solution is not completely trivial. (if we don't want to break a lot of stuff).