Opened 6 years ago
Last modified 6 years ago
#1044 new enhancement
Fill up the value mapping for profileName to resourceClass
Reported by: | matej.durco@oeaw.ac.at | Owned by: | matej.durco@oeaw.ac.at |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | MetadataCuration | Version: | |
Keywords: | Cc: | haaf@bbaw.de, go.sugimoto@oeaw.ac.at, hanna.hedeland@uni-hamburg.de, Caspar Jordan, matej.durco@oeaw.ac.at, Menzo Windhouwer |
Description (last modified by )
As agreed in the Vienna meeting to fill up the bad coverage of the resourceClass
facet, next to normalizing the value within the facet (#1042), we want to map profileNames to resourceClass as well, as they often implicitly carry the resource type in them.
There is a corresponding value-mapping list complied in the VLO-mapping repo on github.
Actually there are two:
- profileName2resourceClass_tf-extended_allProfiles.csv |111| - contains all profiles with records in VLO
- profileName2resourceClass_tf-extended_noResourceClassProfiles.csv |90| - only profiles that have records in VLO but their
resourceClass
-facet is empty.
Both lists are merged with the partial mappings from the Vienna meeting.
They also contain the extra information on how many instances of given profiles are found in the VLO (snapshot from beginning 2018-02) and which collections mainly contribute given values.
Plus it has more TF-*
fields for curators comments.
/?q=_componentProfile:TextProfile
We take the latter, shorter list: profileName2resourceClass_tf-extended_noResourceClassProfiles.csv
and try to fill it up. Only propose a mapped/derived value, where there is no ambiguity or uncertainty. Some profiles are generic covering various resource types. For those the target-column needs to stay empty and they should be marked in the field TF-applicability
with not-applicable
, so that we know that we looked at them, but they are not considered in the mapping.
Following slicing and dividing of the file:
Go | 1- 15 |
Jakob | 16 - 30 |
Susanne | 31 - 45 |
Hanna | 46 - 60 |
Caspar | 61 - 75 |
Matej | 76 - 90 |
The profileName facet (_componentProfile
) is not made visible in the VLO, so for finding and evaluating the records in the VLO you need to write a corresponding query by hand, e.g. _componentProfile:TextProfile
https://vlo.clarin.eu
Change History (9)
comment:1 Changed 6 years ago by
Description: | modified (diff) |
---|---|
Type: | defect → enhancement |
comment:2 Changed 6 years ago by
Cc: | haaf@bbaw.de go.sugimoto@oeaw.ac.at hanna.hedeland@uni-hamburg.de Caspar Jordan matej.durco@oeaw.ac.at Menzo Windhouwer added |
---|---|
Description: | modified (diff) |
comment:4 follow-up: 5 Changed 6 years ago by
I added some changes to the document which should have an affect on the following profile's mapping:
- Meertens Collection: GBA-derived - sub municipality (443)
- Meertens Collection: GBA-derived - sub province (12)
- Meertens Collection: GBA-derived - sub dialectarea (24)
- Meertens Collection: GBA-derived - sub COROP area (40)
comment:5 follow-up: 6 Changed 6 years ago by
comment:6 follow-up: 7 Changed 6 years ago by
comment:7 Changed 6 years ago by
Replying to haaf@…:
Replying to twan@…:
Replying to haaf@…:
I added some changes to the document which should have an affect on the following profile's mapping:
(..)
This relates to pull request #20 I assume
Yes, it does, indeed.
I have merged this branch into the development branch of the clarin-eric/VLO-mapping repository (of which acdh-oeaw/VLO-mapping is a fork).
This is included in pull request #23. Would it be ok with you to close #20?
BTW, in case this is not entirely clear, the network diagram (scroll to the right) illustrates the current state.
comment:8 Changed 6 years ago by
There is an older patch by Jakob
but it has some errors in it (e..g comma instead of semicolon)
And cannot be merge automatically
Can you help me out here what would be the right git-way to handle this?
Otherwise I would simply transfer the changes manually in a new branch.
comment:9 Changed 6 years ago by
@matej I think the easiest approach would be to make the pull request and fix any issues while resolving conflicts (which you should be able to do within the GitHub interface).
A test on alpha-vlo.clarin.eu with
_componentProfile
toresourceClass
mapshows a reduction of the number of records without any resourceClass value from 594k (production) to 69.7k (alpha).
Note that the compiled map had to be adapted to account for the case difference in the field names in this experiment. This has to be fixed permanently; see pull request #22.
Remaining collections with >=10 records without an indicated resource class (indicated by count in parentheses):