Opened 8 years ago

Closed 8 years ago

#836 closed defect (fixed)

Support extraction of CMD attributes based on conceptlinks

Reported by: teckart@informatik.uni-leipzig.de Owned by: teckart@informatik.uni-leipzig.de
Priority: major Milestone: VLO-4.0
Component: VLO importer Version:
Keywords: Cc: Twan Goosen

Description


Change History (12)

comment:1 Changed 8 years ago by DefaultCC Plugin

Cc: Twan Goosen added

comment:2 Changed 8 years ago by teckart@informatik.uni-leipzig.de

A first test implementation (that treats attributes the same way as elements) is available and waits for testing. There is still the question if we want to have a more advanced logic like "ignore element's content if there are attributes with the same conceptlink" or "prefer the first found concept link in the facetMapping file", but I guess we are not there yet.

@Twan: could you add a concept link to "@olac-language" in clarin.eu:cr1:p_1288172614026? (probably: http://hdl.handle.net/11459/CCR_C-2482_08eded24-4086-7e3f-88e5-e0807fb01e17). AFAIK this shouldn't do any harm as importer v3.3 ignores these attributes and according to CMD the link is missing anyway. When changed I will start a test import on my local test machine and we can have a look on the consequences.

comment:3 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Owner: changed from teckart@informatik.uni-leipzig.de to Twan Goosen
Status: newassigned

comment:4 in reply to:  2 Changed 8 years ago by Twan Goosen

Owner: changed from Twan Goosen to teckart@informatik.uni-leipzig.de

Replying to teckart@…:

@Twan: could you add a concept link to "@olac-language" in clarin.eu:cr1:p_1288172614026? (probably: http://hdl.handle.net/11459/CCR_C-2482_08eded24-4086-7e3f-88e5-e0807fb01e17). AFAIK this shouldn't do any harm as importer v3.3 ignores these attributes and according to CMD the link is missing anyway. When changed I will start a test import on my local test machine and we can have a look on the consequences.

Done (see clarin.eu:cr1:p_1288172614026).

comment:5 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Milestone: VLO-4.0 or later

comment:6 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Support added with e3b762473. This shouldn't change a lot as relevant attributes are included in the facet definition via fallback patterns.

For an active use (and a cleanup of the facetconcepts file) some components have to be modified (i.e. conceptlinks have to be added). The only apparent change is that language/@olac-language for OLAC-DcmiTerms? (clarin.eu:cr1:p_1288172614026) is now extracted based on its conceptlink (and its pattern is removed). There may be new values extracted from attributes currently not added via patterns, but a test import didn't show noise in one of the facets.

Nonetheless a more systematic analysis of the fall-back patterns should take place before the 4.0 release; leaving this ticket open.

comment:7 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.0 or laterVLO-4.1 or later

Milestone renamed

comment:8 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.1 or laterVLO-4.0

comment:9 Changed 8 years ago by davor.ostojic@oeaw.ac.at

In order to extract values from attributes for "teiHeader" profile the xpath:
"./xs:complexType/xs:simpleContent/xs:extension/xs:attribute" is not sufficient,
"./complexType/attribute" is needed as well.

Unfortunately I couldn't find the problematic CMD record but it was something related to the availability facet and
"status" attribute in xsd, look for:
<xs:attribute name="status" dcr:datcat="http://hdl.handle.net/11459/CCR_C-5303_9afa7d0c-f292-8e89-24f8-997a3d2971ae">

comment:10 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Thanks for the hint regarding profile "clarin.eu:cr1:p_1380106710826". The current solution only works for attributes on CMD elements. "./complexType/attribute" is needed for component attributes.

Example: CLARIN_DK_UCPH_Repository/oai_clarin_dk_dkclarin_309874.xml (for CCR_C-5303 which is currently not part of the facet mapping)

Last edited 8 years ago by teckart@informatik.uni-leipzig.de (previous) (diff)

comment:11 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Support of component's attributes added with 5b1843b.

comment:12 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: assignedclosed

Evaluation showed that 14 XPaths are currently extracted based on attribute concept links. Of those, two contain potentially relevant information for the language facet (OLAC-DcmiTerms? + teiHeader). The other 12 are now part of the blacklist (4602b98 + 41fcbf4).

Closing the ticket.

Note: See TracTickets for help on using tickets.