Opened 10 years ago

Closed 10 years ago

#683 closed task (fixed)

employ the curated org vocab for the organisation facet

Reported by: xnrn@gmx.net Owned by:
Priority: major Milestone: VLO-3.1
Component: VLO importer Version:
Keywords: Cc: Twan Goosen

Description

employ the (already curated) organisations vocabulary for the normalization of the organisation facet.

The data could be either requested via the API
but it seems much more efficient to fetch a dump (via OAI-PMH convert it accordingly and inject it either as a PostProcesssor? or into Solr's configuration (SOLR synonyms).

Attachments (1)

MPI_Nijmegen_variants.txt (3.1 KB) - added by xnrn@gmx.net 10 years ago.
manually compiled synonyms-file just for all the encountered variants of MPI, Nijmegen

Download all attachments as: .zip

Change History (7)

comment:1 Changed 10 years ago by DefaultCC Plugin

Cc: Twan Goosen added

comment:2 Changed 10 years ago by teckart@informatik.uni-leipzig.de

Milestone: VLO-3.1
Resolution: fixed
Status: newclosed

Added to trunk (r5887) using the OrganisationPostProcessor? and a dump of the current state of the vocabulary

comment:3 Changed 10 years ago by xnrn@gmx.net

Resolution: fixed
Status: closedreopened

The solution can be currently previewed at the VLO-dev-instance of Leipzig

However the solution is unsatisfactory. Testing with the notorious "MPI, Nijmegen" case, we still find 18 variants with "psycholing" in it pertaining to this very entity.

There is also a problem with the curation vocabulary itself: There are '''3''' Concepts/Organisations with "psycholing" in the name pertaining to MPI, Nijmegen, group 23 variations/spellings.
To add to the confusion in CLAVAS itself there are 4 different Concepts/Organisations? (see dump).

However even perfectly matching variations don't seem to be resolved, e.g.:

Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (7)
Max Planck Institute for Psycholinguistics, Nijmegen, NL (87)

Another issue is that certain variants appear in the facet-index multiple times!, e.g.:

Max Planck Institute for Psycholinguistics (MPI)

The proposition is to test with a a small hand-made vocabulary that only contains all the (currently) encountered variants of MPI, Nijmegen and unifies them all to one entity.

Changed 10 years ago by xnrn@gmx.net

Attachment: MPI_Nijmegen_variants.txt added

manually compiled synonyms-file just for all the encountered variants of MPI, Nijmegen

comment:4 Changed 10 years ago by teckart@informatik.uni-leipzig.de

Postprocessing is now more tolerant of consecutive whitespaces in organames (r6000). Problems that still remain (based on "MPI for Psycholinguistics") are:
-missing variants in CLAVAS (but included in Matej's variants file)
-varying mappings for the same organisation in CLAVAS ("MPI for Psycholinguistics" -> "MPI for Psycholinguistics, MPG"; "Max Planck Institute for Psycholinguistics." -> "Max Planck Institute for Psycholinguistics")
-wrong variants in CLAVAS ("MPI" -> "MPI for Psycholinguistics, MPG")

Last edited 10 years ago by teckart@informatik.uni-leipzig.de (previous) (diff)

comment:5 Changed 10 years ago by teckart@informatik.uni-leipzig.de

Some corrections in the variants file (r6020), which should fix the problems for "MPI Nijmegen" (etc.).

If there are no objections I will close this ticket as the general approach has proven to be useful.
There all still many variant entries missing but this is not directly a VLO importer problem, but more a problem of completeness and quality of the CLAVAS vocabulary.

comment:6 Changed 10 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: reopenedclosed

Closing ticket as promised.

Note: See TracTickets for help on using tickets.