Opened 6 years ago

Last modified 6 years ago

#1044 new enhancement

Fill up the value mapping for profileName to resourceClass

Reported by: matej.durco@oeaw.ac.at Owned by: matej.durco@oeaw.ac.at
Priority: major Milestone:
Component: MetadataCuration Version:
Keywords: Cc: haaf@bbaw.de, go.sugimoto@oeaw.ac.at, hanna.hedeland@uni-hamburg.de, Caspar Jordan, matej.durco@oeaw.ac.at, Menzo Windhouwer

Description (last modified by matej.durco@oeaw.ac.at)

As agreed in the Vienna meeting to fill up the bad coverage of the resourceClass facet, next to normalizing the value within the facet (#1042), we want to map profileNames to resourceClass as well, as they often implicitly carry the resource type in them.

There is a corresponding value-mapping list complied in the VLO-mapping repo on github.
Actually there are two:

Both lists are merged with the partial mappings from the Vienna meeting.
They also contain the extra information on how many instances of given profiles are found in the VLO (snapshot from beginning 2018-02) and which collections mainly contribute given values.
Plus it has more TF-* fields for curators comments.
/?q=_componentProfile:TextProfile

We take the latter, shorter list: profileName2resourceClass_tf-extended_noResourceClassProfiles.csv
and try to fill it up. Only propose a mapped/derived value, where there is no ambiguity or uncertainty. Some profiles are generic covering various resource types. For those the target-column needs to stay empty and they should be marked in the field TF-applicability with not-applicable, so that we know that we looked at them, but they are not considered in the mapping.

Following slicing and dividing of the file:

Go 1- 15
Jakob 16 - 30
Susanne 31 - 45
Hanna 46 - 60
Caspar 61 - 75
Matej 76 - 90

The profileName facet (_componentProfile) is not made visible in the VLO, so for finding and evaluating the records in the VLO you need to write a corresponding query by hand, e.g. _componentProfile:TextProfile
https://vlo.clarin.eu

Change History (9)

comment:1 Changed 6 years ago by matej.durco@oeaw.ac.at

Description: modified (diff)
Type: defectenhancement

comment:2 Changed 6 years ago by matej.durco@oeaw.ac.at

Cc: haaf@bbaw.de go.sugimoto@oeaw.ac.at hanna.hedeland@uni-hamburg.de Caspar Jordan matej.durco@oeaw.ac.at Menzo Windhouwer added
Description: modified (diff)

comment:3 Changed 6 years ago by Twan Goosen

A test on alpha-vlo.clarin.eu with

shows a reduction of the number of records without any resourceClass value from 594k (production) to 69.7k (alpha).

Note that the compiled map had to be adapted to account for the case difference in the field names in this experiment. This has to be fixed permanently; see pull request #22.


Remaining collections with >=10 records without an indicated resource class (indicated by count in parentheses):

Wolfenbuettel Digital Library: Deutsche Digitale Bibliothek (21413)
Institut für Deutsche Sprache, CLARIN-D Zentrum, Mannheim (17413)
Wolfenbuettel Digital Library: Digitization on demand (5578)
Meertens Collection: Soundbites (4386)
A Digital Archive of Research Papers in Computational Linguistics (3280)
Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) (2521)
Meertens collection: Liederenbank (1568)
TLA: DiscAn (1457)
ORTOLANG Repository (1427)
CLARIN NL : VALID (1265)
Meertens collection: Etstoel (999)
Meertens Collection: Dynamische Fonologische en Morfologische Atlas van de Nederlandse Dialecten (GTRP) (613)
Meertens collections: PILNAR (583)
Wolfenbuettel Digital Library: Erschließung alchemiegeschichtlicher Quellen an der HAB (573)
Meertens Collection: GBA-derived - sub municipality (443)
Meertens Collection: Heiloo toponiemen (400)
AGD (399)
Meertens Collection: Diversity in Dutch DP Design (DiDDD) (333)
CLARIN Centres (327)
Meertens Collection: Dynamische Syntactische Atlas van de Nederlandse Dialecten (DynaSAND) (267)
Tübingen Language Resources (267)
Wolfenbuettel Digital Library: AEDit - Archiv-, Editions- und Distributionsplattform (179)
California Language Archive (168)
Center of Estonian Language Resources (130)
LINDAT / CLARIN Data & Tools (126)
CLARIN-PL (120)
Språkbanken (116)
CLARIN.SI data & tools (106)
COST elicitationdata (98)
Meertens collection: CRM (96)
Meertens Collections - NDA (95)
LRT + Open Submissions Data & Tools (80)
Fesli (56)
African Language Materials Archive (52)
Meertens Collection: GBA-derived - sub COROP area (40)
Archive of the Indigenous Languages of Latin America (39)
IULA UPF OAI Archive: Resources (30)
The Rosetta Project: A Long Now Foundation Library of Human Language (28)
Meertens Collection: GBA-derived - sub dialectarea (24)
ILC4CLARIN : ILC Data & Tools (18)
CLARINO Bergen Centre (17)
Meertens Collection: GBA-derived - sub province (12)
MPI corpora : Neurogenetics of Vocal Communication (11)
Wolfenbuettel Digital Library (10) 
Last edited 6 years ago by Twan Goosen (previous) (diff)

comment:4 Changed 6 years ago by haaf@bbaw.de

I added some changes to the document which should have an affect on the following profile's mapping:

  • Meertens Collection: GBA-derived - sub municipality (443)
  • Meertens Collection: GBA-derived - sub province (12)
  • Meertens Collection: GBA-derived - sub dialectarea (24)
  • Meertens Collection: GBA-derived - sub COROP area (40)

comment:5 in reply to:  4 ; Changed 6 years ago by Twan Goosen

Replying to haaf@…:

I added some changes to the document which should have an affect on the following profile's mapping:
(..)

This relates to pull request #20 I assume

comment:6 in reply to:  5 ; Changed 6 years ago by haaf@bbaw.de

Replying to twan@…:

Replying to haaf@…:

I added some changes to the document which should have an affect on the following profile's mapping:
(..)

This relates to pull request #20 I assume

Yes, it does, indeed.

comment:7 in reply to:  6 Changed 6 years ago by Twan Goosen

Replying to haaf@…:

Replying to twan@…:

Replying to haaf@…:

I added some changes to the document which should have an affect on the following profile's mapping:
(..)

This relates to pull request #20 I assume

Yes, it does, indeed.

I have merged this branch into the development branch of the clarin-eric/VLO-mapping repository (of which acdh-oeaw/VLO-mapping is a fork).

This is included in pull request #23. Would it be ok with you to close #20?

BTW, in case this is not entirely clear, the network diagram (scroll to the right) illustrates the current state.

comment:8 Changed 6 years ago by matej.durco@oeaw.ac.at

There is an older patch by Jakob
but it has some errors in it (e..g comma instead of semicolon)
And cannot be merge automatically

Can you help me out here what would be the right git-way to handle this?
Otherwise I would simply transfer the changes manually in a new branch.

comment:9 Changed 6 years ago by Twan Goosen

@matej I think the easiest approach would be to make the pull request and fix any issues while resolving conflicts (which you should be able to do within the GitHub interface).

Note: See TracTickets for help on using tickets.