Opened 12 years ago

Closed 9 years ago

#246 closed enhancement (fixed)

exclude actor language from language facet

Reported by: dietuyt Owned by: teckart@informatik.uni-leipzig.de
Priority: major Milestone: VLO-3.3
Component: VLO importer Version:
Keywords: Cc: teckart, keeloo, mwindhouwer

Description

Right now all elements that are linked to the data category 2482 (language code) and 2484 (language name) are imported to the facet "language". As reported by Florian Schiel and Jan Odijk, this leads to the situation where languages that do not appear in a certain recording are listed nevertheless in the VLO.

To avoid this, some kind of check should find place upon import if one of the ancestors of the element is not linked to a container data category that indicates this language is *not* a relevant content language. Currently there is only http://www.isocat.org/datcat/DC-4146 (container data category "actor") but this list will be extended (e.g. for documentation language)

See also: example of such a CMDI , leading to the (incorrect) language facet value "Latin"

Change History (10)

comment:1 Changed 12 years ago by DefaultCC Plugin

Cc: teckart keeloo added

comment:2 Changed 12 years ago by dietuyt

some context - a related mail from Florian on this subject:

Hi Florian,

Thanks for reporting I'll answer this mail in 2 parts as different
people are involved for each issue. Here is part 1 - container data
categories and the VLO

On 30/11/12 09:44 , Florian Schiel wrote:

To all who can manipulate the CMD registry!

We think that the component cmdi-actorlanguage is ill-defined in the
registry:

(see
http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:c_1271859438147)

The component itself points to isocat 4146, which is not a definition of
language but the definition of 'actor'.

Yes, it's not the most elegant solution. But I changed it this way
because you reported earlier on that the value of actor.cmdi-language
ended up in the VLO facet "Language" as explained in

https://trac.clarin.eu/ticket/246

So we need to be able to distinguish here between content language and
actor language. I see 3 options:

  • We exclude actor.language (by detecting the combination of the

container datcat for actor + the general language datcat) - that is what
the ticket above says. The only way to make sure that this is excluded
is to have both container datcat and language datcat in the same
component (relying on a higher located actor somewhere does not work if
someone uses this component without embedding it in an actor)

(backward compatible)

  • We make some copies of the iso-language-639-3 component (including its

± 7000 language codes) and give it a more detailed data category
(content language, actor language, etc.)

(maybe backward compatible if we give it the same name which would be
confusing)

  • we create more container datcats (content, documentation, ...) and

create new components that are purely a wrapper around
iso-language-639-3 so that we have a direct

[actor container DC].[language ID DC]
[content container DC].[language ID DC]
etc.

(not backward compatible)

What are the opinions on these approaches?

comment:3 Changed 11 years ago by herold

Considering the CMDI example in the initial report: Wouldn't you want to look for SubjectLanguage? instead of ActorLanguage? anyway?

comment:4 Changed 11 years ago by dietuyt

Cc: mwindhouwer added

Is addressed by adding the necessary container data categories (overview) and Menzo's recent addition to the VLO ingester that takes care of the container data category. To be tested soon.

comment:5 Changed 11 years ago by dietuyt

Milestone: VLO-2.16

comment:6 Changed 11 years ago by twagoo

Component: VLO web appVLO importer
Milestone: VLO-2.16

comment:7 Changed 10 years ago by latadmin@mpi.nl

Owner: changed from herste to Twan Goosen
Status: newassigned

comment:8 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Owner: changed from Twan Goosen to teckart@informatik.uni-leipzig.de

comment:9 Changed 9 years ago by Twan Goosen

Milestone: VLO-3.3

Moved a number of existing VLO tickets to 3.3 milestone

comment:10 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: assignedclosed

Enabling context blacklisting again using concept DC-4146 as illegal context (r6318). No direct impact on extraction as problematic elements were already blacklisted by using simple blacklist patterns. Improves readability of facetconcept file. Very likely more blacklisting rules have to be added in the future for other facets.

Note: See TracTickets for help on using tickets.