Version 27 (modified by 13 years ago) (diff) | ,
---|
Notes on the re-implementation of the VLO
inspiration sources
- VLO 1.0: http://ems06.mpi.nl/cgi-bin/flamenco.cgi/flamh/Flamenco
- OLAC facet browser: http://syslsl01.library.upenn.edu/dla/olac/index.html
technology to be used
- SOLR for the backend
- wicket for the front-end
- CMDI for the input
data sources
- The CMDI'zed IMDI corpus: http://www.mpi.nl/imdi/documents/cmdi-20100924.tar.gz
- the CMDI files harvested via OAI-PMH: http://www.mpi.nl/imdi/documents/olac-cmdi-20101214.tar.gz
- CMDI version of the LRT inventory: http://www.mpi.nl/imdi/documents/lrt-20101201.tar.gz
requirements
- in first instance same functionality as VLO 1.0
- using CMDI as direct input
- being reliable and debuggable
- using standardized well-maintained libraries
- a button to report errors spotted in the metadata
- link the ISO-639-3 language codes to http://www.clarin.eu/external/language.php?code=eng so that users can get more information about a used language at each point
suggested mapping CMDI OLAC profile > VLO 2.0 fields/facets
/CMD/Components/OLAC-DcmiTerms/title -> name
/CMD/Components/OLAC-DcmiTerms/subject -> subject when multiple subject elements exist, give a preference to (in order of preference):
[@dcterms-type="LCSH"]
ignore (for now, until we have a resolution solution for the numeric codes): [@dcterms-type="DDC"] [@dcterms-type="LCC"]
/CMD/Components/OLAC-DcmiTerms/publisher -> organisation
/CMD/Header/MdSelfLink -> id
/CMD/Components/OLAC-DcmiTerms/language[@olac-language] -> language
/CMD/Header/MdSelfLink (URL after first ":") (via OAI-PMH) -> origin
/CMD/Components/OLAC-DcmiTerms/type[@olac-linguistic-type] -> genre
/CMD/Components/OLAC-DcmiTerms/description -> description
/CMD/Components/OLAC-DcmiTerms/identifier (if starting with http:// or hdl:) -> open in original context (now: IMDI browser)
/CMD/Components/OLAC-DcmiTerms/date -> year (new facet, extract yyyy from yyyy-mm-dd or yyyy-mm-ddThh:mm:ssZ or take over yyyy)
Some example dates, this won't be easy to curate and put in nice little facets. <date>1985-87</date> <date>1988-90?</date> <date>Jan-Feb 1990</date> <date>Dec 1988-</date> <date>1988-?</date> <date>5/91</date> <date>16/9/90</date> <date>5/91</date> <date>6/4/1991</date>
/CMD/Components/OLAC-DcmiTerms/type[@dcterms-type="DCMIType"] -> resource type (new facet)
/CMD/Components/OLAC-DcmiTerms/spatial[@dcterms-type="ISO3166"] -> country
/CMD/Components/OLAC-DcmiTerms/coverage[@dcterms-type="ISO3166"] -> country
/CMD/Components/OLAC-DcmiTerms/format[@dcterms-type="IMT"] -> format (new facet, contains mime type)
suggested mapping CMDI LRT profile > VLO 2.0 fields/facets
/CMD/Components/LrtInventoryResource/LrtCommon/ResourceName -> name
/CMD/Header/MdSelfLink -> id
/CMD/Components/LrtInventoryResource/LrtCommon/Languages/ISO639/iso-639-3-code -> language (convert code to full language name)
clarin.eu -> origin (fixed value)
?? -> genre
/CMD/Components/LrtInventoryResource/LrtCommon/Description -> description
/CMD/Components/LrtInventoryResource/LrtCommon/MetadataLink (if not existing: ReferenceLink?) -> open in original context
/CMD/Components/LrtInventoryResource/LrtCommon/FinalizationYearResourceCreation -> year
/CMD/Components/LrtInventoryResource/LrtCommon/ResourceType -> resource type
/CMD/Components/LrtInventoryResource/LrtCommon/Countries/Country/code -> country (convert code to full country name)
/CMD/Components/LrtInventoryResource/LrtCommon/Format-> format
short notes VLO meeting
- have result list next to facets
- result list: name + description
- link back to original OLAC xml record / imdi record
- post-process the harvested data before importing it into VLO: country, language, organisation
- extract the year on the fly
- eperiment with SOV facets, derived from the language name
language codes
- old SIL codes and their ISO-639-3 equivalent: http://www.ethnologue.com/14/show_language.asp?code=ach
- SIL codes list: http://tal.univ-paris3.fr/mkAlign/mka-online/LanguageCodes.html