Opened 10 years ago
Closed 9 years ago
#554 closed defect (fixed)
Missing language information for some files
Reported by: | teckart@informatik.uni-leipzig.de | Owned by: | teckart |
---|---|---|---|
Priority: | major | Milestone: | VLO-3.1 |
Component: | VLO importer | Version: | |
Keywords: | Cc: | Twan Goosen |
Description
Apparently the new beta version of the VLO does not contain language information for some files [2], whereas the same record in the production VLO does contain the correct value for this facet ([1]: "kil", "Kariya"). Neither the importer process, nor the source OLAC or the OLAC2CMDI XSLT script ([4]) seem to be changed in the last weeks ([5]).
[1] http://catalog.clarin.eu/vlo/record?0&fq=language:Kariya&docId=Other/oai_ethnologue_com_kil.xml
[2] http://catalog-clarin.esc.rzg.mpg.de/vlo-solr/core0/select?q=*%3A*&fq=id%3A+Others%2Foai_ethnologue_com_kil.xml&wt=json&indent=true
[3] http://catalog.clarin.eu/oai-harvester/others/oai-pmh/Ethnologue_Languages_of_the_World/oai_ethnologue_com_kil.xml (source), http://catalog.clarin.eu/oai-harvester/others/results/olac/Ethnologue_Languages_of_the_World/oai_ethnologue_com_kil.xml (OLAC extraction)
[4] https://github.com/TheLanguageArchive/oai-harvest-manager/blob/master/src/main/resources/olac2cmdi.xsl
[5] http://catalog.clarin.eu/oai-harvester/resultsets/backups/
Change History (13)
comment:1 Changed 10 years ago by
Cc: | Twan Goosen added |
---|
comment:2 Changed 10 years ago by
Milestone: | → VLO-3.0 |
---|
comment:3 follow-up: 4 Changed 10 years ago by
comment:4 Changed 10 years ago by
Replying to j.knappen@…:
For that particular resource, the language is not tagged in the Metadata. It seems that VLO 2.18 somehow extracts the language information from the title, the subject, or the SIL identifier.
Looks like a case for metadata curation to me.
The language code is in there actually:
... <OLAC-DcmiTerms> ... <publisher>SIL International (www.sil.org)</publisher> <subject olac-language="kil"/> <title>Kariya: a language of Nigeria</title> ...
It matches the pattern
/c:CMD/c:Components//c:OLAC-DcmiTerms/c:subject/@olac-language
defined in facetConcepts.xml.
comment:5 follow-up: 6 Changed 10 years ago by
That is true, but the normal behaviour of the importer is that these hard-coded patterns will be ignored if the CMDI profile contains at least one element that has a matching conceptlink. This is the case here ("OLAC-DcmiTerms?/language", although empty). Therefore normally this attribute is not used to fill the language facet for this profile.
I agree with Jörg here that it would be a cleaner way to use the element that was actually supposed to store the language information instead of using some underspecified attribute. Nonetheless it leaves the question how the language information was extracted in the production VLO.
comment:6 Changed 10 years ago by
Replying to teckart@…:
I agree with Jörg here that it would be a cleaner way to use the element that was actually supposed to store the language information instead of using some underspecified attribute. Nonetheless it leaves the question how the language information was extracted in the production VLO.
The reason seems to be that the overriding (if I understand this correctly) concept link was added in the 3.0 branch in [4575] and this change was never applied to 2.x. Maybe the mapping for 3.0 needs to be reconsidered?
comment:7 Changed 10 years ago by
That's it! I think we should leave the facet definition as it is (as http://purl.org/dc/terms/language is a correct DC for this facet) and try to convince the creator to include values in the language element in the future. In my opinion the fixed patterns should just be seen as a shortterm workaround and I don't think that it is a good idea to optimize the VLO results for them.
comment:8 Changed 10 years ago by
The direct creator of these CMDI's is the harvester, so maybe something can be done there, i.e. fill the 'language' field with the value from the attribute. Additionally or alternatively we could assign the data category to the 'olac-language' attribute in the profile specification as well. But maybe there are arguments against doing that?
comment:9 Changed 10 years ago by
I would prefer the first solution (should be only a minor change in the XSLT). We could also assign a concept link to the attribute, but as there is already a "correct" element in the profile this is probably not necessary...
comment:10 Changed 10 years ago by
Milestone: | VLO-3.0 |
---|
Taking this off the 3.0 milestone, issue has been resolved (or mitigated) by removing the "http://purl.org/dc/terms/language" data category from the facet mapping (see #412, r5192, r5195).
comment:11 follow-up: 12 Changed 10 years ago by
Closing this; followed up by MPI #4271.
Menzo has looked into an alternative mapping strategy where there is a fallback to patterns in case of no matches from the data categories; this might improve the situation, too.
comment:12 Changed 10 years ago by
Replying to twan.goosen@…:
Closing this; followed up by MPI #4271.
Menzo has looked into an alternative mapping strategy where there is a fallback to patterns in case of no matches from the data categories; this might improve the situation, too.
See #668
comment:13 Changed 9 years ago by
Milestone: | → VLO-3.1 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Concept "http://purl.org/dc/terms/language" was added again to the language facets (r5990), problematic ethnologue files are still working because of modified extraction mechanism (r5985).
For that particular resource, the language is not tagged in the Metadata. It seems that VLO 2.18 somehow extracts the language information from the title, the subject, or the SIL identifier.
Looks like a case for metadata curation to me.