= Notes on the re-implementation of the VLO = == inspiration sources == * VLO 1.0: http://ems06.mpi.nl/cgi-bin/flamenco.cgi/flamh/Flamenco * OLAC facet browser: http://syslsl01.library.upenn.edu/dla/olac/index.html == technology to be used == * SOLR for the backend * wicket for the front-end * CMDI for the input == data sources == * The CMDI'zed IMDI corpus: http://www.mpi.nl/imdi/documents/cmdi-20100924.tar.gz * the CMDI files harvested via OAI-PMH: http://www.mpi.nl/imdi/documents/olac-cmdi-20101214.tar.gz * CMDI version of the LRT inventory: http://www.mpi.nl/imdi/documents/lrt-20101201.tar.gz == requirements == * in first instance same functionality as VLO 1.0 * using CMDI as direct input * being reliable and debuggable * using standardized well-maintained libraries * a button to report errors spotted in the metadata * link the ISO-639-3 language codes to http://www.clarin.eu/external/language.php?code=eng so that users can get more information about a used language at each point == suggested mapping CMDI OLAC profile > VLO 2.0 fields/facets == === Name === /CMD/Components/OLAC-!DcmiTerms/title -> name === Subject === Use: /CMD/Components/OLAC-!DcmiTerms/subject[@olac-linguistic-field] -> subject /CMD/Components/OLAC-!DcmiTerms/subject -> subject when [@dcterms-type="LCSH"] ignore (for now, until we have a resolution solution for the numeric codes): /CMD/Components/OLAC-!DcmiTerms/subject when [@dcterms-type="DDC"] /CMD/Components/OLAC-!DcmiTerms/subject when [@dcterms-type="LCC"] ignore (because it results in too much noise): /CMD/Components/OLAC-!DcmiTerms/subject === Organisation === /CMD/Components/OLAC-!DcmiTerms/publisher -> organisation === Id === /CMD/Header/MdSelfLink -> id === Language === /CMD/Components/OLAC-!DcmiTerms/language[@olac-language] -> language /CMD/Components/OLAC-!DcmiTerms/subject[@olac-language] -> language === Origin === /CMD/Header/MdSelfLink (URL after first ":") (via OAI-PMH) -> origin === Genre === /CMD/Components/OLAC-!DcmiTerms/type[@olac-linguistic-type] -> genre === Description === /CMD/Components/OLAC-!DcmiTerms/description -> description === open in original context === /CMD/Components/OLAC-!DcmiTerms/identifier (if starting with !http:// or hdl:) -> open in original context (now: IMDI browser) === Year === /CMD/Components/OLAC-!DcmiTerms/date -> year (new facet, extract yyyy from yyyy-mm-dd or yyyy-mm-ddThh:mm:ssZ or take over yyyy) {{{ Some example dates, this won't be easy to curate and put in nice little facets. 1985-87 1988-90? Jan-Feb 1990 Dec 1988- 1988-? 5/91 16/9/90 5/91 6/4/1991 }}} === Resource Type === /CMD/Components/OLAC-!DcmiTerms/type[@dcterms-type="DCMIType"] -> resource type (new facet) === Country === /CMD/Components/OLAC-!DcmiTerms/spatial[@dcterms-type="ISO3166"] -> country /CMD/Components/OLAC-!DcmiTerms/coverage[@dcterms-type="ISO3166"] -> country === Format === /CMD/Components/OLAC-!DcmiTerms/format[@dcterms-type="IMT"] -> format (new facet, contains mime type) === suggested mapping CMDI LRT profile > VLO 2.0 fields/facets === /CMD/Components/LrtInventoryResource/LrtCommon/ResourceName -> name /CMD/Header/MdSelfLink -> id /CMD/Components/LrtInventoryResource/LrtCommon/Languages/ISO639/iso-639-3-code -> language (convert code to full language name) clarin.eu -> origin (fixed value) ?? -> genre /CMD/Components/LrtInventoryResource/LrtCommon/Description -> description /CMD/Components/LrtInventoryResource/LrtCommon/MetadataLink (if not existing: ReferenceLink) -> open in original context /CMD/Components/LrtInventoryResource/LrtCommon/FinalizationYearResourceCreation -> year /CMD/Components/LrtInventoryResource/LrtCommon/ResourceType -> resource type /CMD/Components/LrtInventoryResource/LrtCommon/Countries/Country/code -> country (convert code to full country name) /CMD/Components/LrtInventoryResource/LrtCommon/Format-> format === suggested mapping CMDI IMDI profile > VLO 2.0 fields/facets === see source:vlo/trunk/vlo_webapp/src/main/resources/importerConfig.xml === short notes VLO meeting === * have result list next to facets * result list: name + description * link back to original OLAC xml record / imdi record * post-process the harvested data before importing it into VLO: country, language, organisation * extract the year on the fly * eperiment with SOV facets, derived from the language name === language codes === * old SIL codes and their ISO-639-3 equivalent: http://www.ethnologue.com/14/show_language.asp?code=ach * SIL codes list: http://tal.univ-paris3.fr/mkAlign/mka-online/LanguageCodes.html