Opened 10 years ago
Closed 10 years ago
#683 closed task (fixed)
employ the curated org vocab for the organisation facet
Reported by: | xnrn@gmx.net | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | VLO-3.1 |
Component: | VLO importer | Version: | |
Keywords: | Cc: | Twan Goosen |
Description
employ the (already curated) organisations vocabulary for the normalization of the organisation facet.
The data could be either requested via the API
but it seems much more efficient to fetch a dump (via OAI-PMH convert it accordingly and inject it either as a PostProcesssor? or into Solr's configuration (SOLR synonyms).
Attachments (1)
Change History (7)
comment:1 Changed 10 years ago by
Cc: | Twan Goosen added |
---|
comment:3 Changed 10 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
The solution can be currently previewed at the VLO-dev-instance of Leipzig
However the solution is unsatisfactory. Testing with the notorious "MPI, Nijmegen" case, we still find 18 variants with "psycholing" in it pertaining to this very entity.
There is also a problem with the curation vocabulary itself: There are '''3''' Concepts/Organisations with "psycholing" in the name pertaining to MPI, Nijmegen, group 23 variations/spellings.
To add to the confusion in CLAVAS itself there are 4 different Concepts/Organisations? (see dump).
However even perfectly matching variations don't seem to be resolved, e.g.:
Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (7) Max Planck Institute for Psycholinguistics, Nijmegen, NL (87)
Another issue is that certain variants appear in the facet-index multiple times!, e.g.:
Max Planck Institute for Psycholinguistics (MPI)
The proposition is to test with a a small hand-made vocabulary that only contains all the (currently) encountered variants of MPI, Nijmegen and unifies them all to one entity.
Changed 10 years ago by
Attachment: | MPI_Nijmegen_variants.txt added |
---|
manually compiled synonyms-file just for all the encountered variants of MPI, Nijmegen
comment:4 Changed 10 years ago by
Postprocessing is now more tolerant of consecutive whitespaces in organames (r6000). Problems that still remain (based on "MPI for Psycholinguistics") are:
-missing variants in CLAVAS (but included in Matej's variants file)
-varying mappings for the same organisation in CLAVAS ("MPI for Psycholinguistics" -> "MPI for Psycholinguistics, MPG"; "Max Planck Institute for Psycholinguistics." -> "Max Planck Institute for Psycholinguistics")
-wrong variants in CLAVAS ("MPI" -> "MPI for Psycholinguistics, MPG")
comment:5 Changed 10 years ago by
Some corrections in the variants file (r6020), which should fix the problems for "MPI Nijmegen" (etc.).
If there are no objections I will close this ticket as the general approach has proven to be useful.
There all still many variant entries missing but this is not directly a VLO importer problem, but more a problem of completeness and quality of the CLAVAS vocabulary.
comment:6 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Closing ticket as promised.