Changeset 3776 for SMC4LRT


Ignore:
Timestamp:
10/16/13 16:06:54 (11 years ago)
Author:
vronk
Message:

final layout cleaning; backup

Location:
SMC4LRT
Files:
18 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Outline.tex

    r3681 r3776  
    7676
    7777\listoffigures
    78 \listoftodos
    79 \begin{comment}
     78%\listoftodos
     79%\begin{comment}
    8080\input{chapters/Introduction}
    8181
     
    8383
    8484
    85 \input{chapters/Definitions}
    86 \end{comment}
     85
     86%\end{comment}
    8787\input{chapters/Data}
    8888
    89 \begin{comment}
     89%\begin{comment}
    9090
    9191\input{chapters/Infrastructure}
     
    9999\input{chapters/Conclusion}
    100100
    101 \end{comment}
     101%\end{comment}
    102102
    103103
     
    108108\appendix
    109109
    110 %\input{chapters/appendix}
     110\input{chapters/Definitions}
     111\input{chapters/appendix}
    111112
    112113
  • SMC4LRT/chapters/Conclusion.tex

    r3665 r3776  
    88% Dynamic integration of the information from the Relation Registry into the search interface and search processing.
    99
    10 A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and transform all the data (profiles and instances) into RDF, a server solution to host the data and allow querying it and finally, on top of it offer a web interface for the users to explore the dataset.
     10A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and to transform all the data (profiles and instances) into RDF, a server solution to host the data and to allow querying it and, eventually, a web interface for the users to explore the dataset.
    1111
    1212%Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original"
    13 And finally, a visualization tool for the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}.
    14 Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
     13And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
    1514
    1615Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established,
    17 one of them being metadata curation. The results of this work represent a directly applicable groundwork for this ongoing effort.
     16one of them being metadata curation. The results of this work represent a directly applicable input for this ongoing effort.
    1817One particularly pressing aspect of the curation is the consolidation of the actual values in the CMD records, a topic explicitly treated in this work.
  • SMC4LRT/chapters/Data.tex

    r3681 r3776  
    1010The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
    1111CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
    12 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
     12The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
    1313indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
    1414
     
    1717While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
    1818
    19 Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
     19Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records.
    2020The generated schema also conveys as annotation the information about the referenced data categories.
    2121
     
    2424In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
    2525
    26 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
    27 (when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
     26Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\concept{dublincore}, \concept{collection}, the set of \concept{Bamdes}-profiles) there are complex profiles with up to 10 levels (\concept{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 distinct components and 337 elements (or 419 components and 1587 elements when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \concept{Contact}) included by three other components (\concept{Project}, \concept{Institution}, \concept{Access}) will appear three times in the instantiated record.}).
    2827
    2928
     
    136135Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
    137136
    138 Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI???
     137As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
    139138
    140139
     
    150149\end{description}
    151150
    152 Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
     151The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
    153152
    154153There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
     
    160159\label{def:OLAC}
    161160
    162 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
    163 
    164 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
     161\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}. 
     162
     163The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
    165164
    166165\begin{quotation}
     
    234233One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
    235234
    236 ? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
     235%? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
    237236
    238237
    239238\subsection{ELRA}
    240239
    241 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well.
     240European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well).
    242241The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
    243242Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
     
    254253\subsection{LDC}
    255254
    256 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources
    257 
     255Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives.
    258256
    259257\section{Formats and Collections in the World of Libraries}
    260 
    261 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right.
     258\label{sec:lib-formats}
     259
     260There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
    262261
    263262%\item[LoC] Library of Congress \url{http://www.loc.gov}
     
    280279Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
    281280
    282 Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
     281\xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
    283282more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
    284283
    285 In 1998 a new  Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
    286 and since ?? RDA - Resource Description and Access
     284There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
     285And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now.
    287286
    288287\subsection{ESE, Europeana Data Model - EDM}
    289288
    290 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently
    291 
    292 originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}.
    293 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.
     289Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
     290
     291For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
     292EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format.
    294293%https://github.com/europeana
    295294
     
    304303Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
    305304
    306 In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.
     305In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}
    307306How this resources will be employed is discussed in \ref{sec:values2entities}.
     307Additionally, some verbose commentary follows.
    308308
    309309%\subsubsection{Named entities}
     
    312312Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
    313313
    314 Yago is a large knowledge integrating dbpedia, geonames and ..??
    315 
    316 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
     314Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
     315
     316Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}.
    317317
    318318So we witness a strong general trend towards Semantic Web and Linked Open Data.
     
    323323
    324324%\subsection{Concepts -- Classifications, Taxonomies, \dots}
     325
     326
     327\begin{comment}
     328
     329VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
     330
     331\subsection{schema.org}
     332http://schema.org/docs/datamodel.html
     333http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
     334
     335microdata or
     336http://www.w3.org/TR/rdfa-lite/
     337 Resource Description Framework in attributes
     338
     339the entire WorldCat cataloging collection made publicly
     340available using Schema.org mark-up with library extensions for use by developers and
     341search partners such as Bing, Google, Yahoo! and Yandex
     342
     343OCLC begins adding linked data to WorldCat by appending
     344Schema.org descriptive mark-up to WorldCat.org pages, thereby
     345making OCLC member library data available for use by intelligent
     346Web crawlers such as Google and Bing
     347
     348\end{comment}
     349
     350\section{Summary}
     351
     352In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
     353We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
     354
    325355
    326356
     
    345375& & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
    346376Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
    347 \href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\
     377\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons& ontology-based portal for LRT & \href{http://www.lt-world.org/kb/}{portal} \\
     378& & 4.600 organizations & & \\
    348379Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
    349380PKND     & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
     
    389420GND/s & DNB & 202.000 & subjects (Schlagwörter), universal, lang:de & \\
    390421GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
    391 DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\
     422DDC & OCLC & & universal classification by field of study, multi langs & \href{http://dewey.info/}{dewey.info} \\
    392423UDC & & & & \\
    393424Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
    394425 DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
    395 ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\
    396 Object Names Thesaurus & British Museum & &  classification of objects in the collection & \\
    397 Material Thesaurus & British Museum & & classification of material & \\
    398 Thesaurus of Monument Types & British Museum & & types of monuments & \\
     426ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts & \href{http://www.isocat.org}{web-app}, service \\
     427Object Names Thes. & British Museum & &  classification of objects in the collection & \\
     428Material Thes. & British Museum & & classification of material & \\
     429Thes. Monument Types & British Museum & & types of monuments & \\
    399430Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
    400431Oberbegriffsdatei  & DMB & & a set of vocabularies for museums, lang:de  & \url{museumsvokabular.de}, PDF, XML dumps\\
     
    408439\end{landscape}
    409440
    410 \begin{description}
    411 \item[AAT] international Architecture and Arts Thesaurus, Getty
    412 \item[CONA] Cultural Objects Name Authority
    413 \item[DAI] Deutsches ArchÀologisches Institut
    414 \item[DDC] Dewey Decimal Classification
    415 \item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz
    416 \item[DMB] Deutscher Museumsbund
    417 \item[DNB] Deutsche National Bibliothek
    418 \item[FAST] Faceted Application of Subject Terminology
    419 \item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust
    420 \item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library
    421 \item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
    422 \begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation}
    423 \item[ISO] International Standardization Organization
    424 \item[LCCN] Library of Congress Control Number
    425 \item[LCC] Library of Congress Classification
    426 \item[LCSH] Library of Congress Subject Headings
    427 \item[LoC] Library of Congress\furl{http://loc.gov}
    428 \item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation
    429 \item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd}
    430 \item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History
    431 \item[TGN] Getty Thesaurus of Geographic Names
    432 \item[UDC] Universal Decimal Classification                             
    433 \item[ULAN] Union List of Artist Names
    434 \item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries
    435 \end{description}
    436 
    437 
    438 \begin{comment}
    439 
    440 VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
    441 
    442 \subsection{schema.org}
    443 http://schema.org/docs/datamodel.html
    444 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
    445 
    446 microdata or
    447 http://www.w3.org/TR/rdfa-lite/
    448  Resource Description Framework in attributes
    449 
    450 the entire WorldCat cataloging collection made publicly
    451 available using Schema.org mark-up with library extensions for use by developers and
    452 search partners such as Bing, Google, Yahoo! and Yandex
    453 
    454 OCLC begins adding linked data to WorldCat by appending
    455 Schema.org descriptive mark-up to WorldCat.org pages, thereby
    456 making OCLC member library data available for use by intelligent
    457 Web crawlers such as Google and Bing
    458 
    459 \end{comment}
    460 
    461 \section{Summary}
    462 
    463 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
    464 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications).
    465 
     441
     442
     443\begin{table}
     444\caption{Glossary of acronyms used in the overview of controlled vocabularies (tables \ref{table:data-ne}, \ref{table:data-concepts}) }
     445\label{table:vocab-glossary}
     446
     447%  \begin{tabu}{  >{\sffamily}l p{0.8\textwidth}
     448\begin{tabular}{ >{\sffamily}l p{0.8\textwidth}}
     449%    \hline
     450%\rowfont{\itshape\small} name & provider & size (items / facts)  & description & access \\
     451 %   \hline
     452
     453AAT & international Architecture and Arts Thesaurus, Getty \\
     454CONA & Cultural Objects Name Authority \\
     455DAI & Deutsches ArchÀologisches Institut \\
     456DDC & Dewey Decimal Classification       \\
     457DFKI & Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz \\
     458DMB & Deutscher Museumsbund \\
     459DNB & Deutsche National Bibliothek \\
     460FAST & Faceted Application of Subject Terminology \\
     461Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\
     462GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\
     463GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\
     464% {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\
     465ISO & International Standardization Organization \\
     466LCCN & Library of Congress Control Number \\
     467LCC & Library of Congress Classification \\
     468LCSH & Library of Congress Subject Headings \\
     469LoC & Library of Congress\furl{http://loc.gov} \\
     470OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\
     471PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\
     472RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\
     473TGN & Getty Thesaurus of Geographic Names \\
     474UDC & Universal Decimal Classification                            \\
     475ULAN & Union List of Artist Names \\
     476VIAF & Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries  \\
     477\end{tabular}
     478\end{table}
     479
  • SMC4LRT/chapters/Definitions.tex

    r3680 r3776  
    7474\end{definition}
    7575
    76 \noindent
     76\begin{example1}
    7777Example blocks, simple:
    78 \begin{example1}
    79 Short piece of sample data
    8078\end{example1}
    8179
    82 \noindent
    83 or with tabs (especially for RDF triples):
    8480\begin{example3}
    85 my:work & my:example & my:block
     81or with & tabs (especially for & RDF triples)
    8682\end{example3}
  • SMC4LRT/chapters/Design_SMCinstance.tex

    r3680 r3776  
    1212relevant parts in a triple store and do your SPARQL/reasoning on it. Well
    1313that's where I'm ultimately heading with all these registries related to
    14 semantic interoperability ... I hope ;-)\cite{Menzo2013mail}
     14semantic interoperability ... I hope ;-)
     15
     16\hfill \textit{Menzo Windhouwer} \cite{Menzo2013mail}
    1517\end{quotation}
     18
    1619
    1720As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
     
    3841\subsection{CMD specification}
    3942
    40 The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:
     43The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element.
    4144
    4245\label{table:rdf-spec}
    4346\begin{example3}
    44 cmds:Component & subClassOf  & owl:Class. \\
    45 cmds:Profile & subClassOf  & cmds:Component. \\
    46 cmds:Element & subClassOf  & rdf:Property. \\
    47 \end{example3}
     47cmds:Component & a  & rdfs:Class. \\
     48cmds:Profile & rdfs:subClassOf  & cmds:Component. \\
     49cmds:Element & a  & rdfs:Class. \\
     50cmds:ElementValue & a & rdf:Property \\
     51cmds:Attribute & a & rdf:Property \\
     52\end{example3}
     53
    4854
    4955\noindent
     
    5662 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
    5763cmd:Actor       & a & cmds:Component. \\
    58 cmd:LanguageName  & a & cmds:Element. \\
    59 \end{example3}
    60 
    61 \begin{note}
    62 Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
    63 \end{note}
     64cmd:Actor.LanguageName  & a & cmds:Element. \\
     65\end{example3}
     66
     67%\begin{note}
     68%Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
     69%\end{note}
     70
    6471
    6572\subsection{Data Categories}
     
    6976dcr:datcat & a  & owl:AnnotationProperty ; \\
    7077 & rdfs:label  & "data category"@en ; \\
    71  & rdfs:comment  & "This resource is equivalent to  this data category."@en ; \\
     78 & rdfs:comment  & "This resource is equivalent to this data category."@en ; \\
    7279 & skos:note  & "The data category should be identified by its PID."@en ; \\
    7380\end{example3}
     
    8794
    8895\noindent
    89 Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows:
    90 
    91 \begin{example3}
    92 <lr1> & isocat:DC-2502 & "19th century"
    93 \end{example3}
    94 
    95 \noindent
    96 However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.
    97 
    98 This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties.
    99 
    100 
    101 \noindent
    102 The REST web service of \xne{ISOcat} provides a RDF representation of the data categories:
    103 
    104 \begin{example3}
    105 isocat:languageName & dcr:datcat & isocat:DC-2484; \\
    106  & rdfs:label & "language name"@en; \\
    107  & rdfs:comment & "A human understandable..."@en; \\
    108  & 
  \\
    109 \end{example3}
    110 
    111 However this is only meant as template, as is stated in the explanatory comment of the exported data:
    112 
    113 \begin{quotation}
    114 By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
    115 \end{quotation}
    116 
    117 So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
     96However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL}
     97In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
    11898
    11999\begin{example3}
     
    132112
    133113\noindent
    134 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications.
    135 
    136 \begin{note}
    137 Does this mean, that I would say:
    138 \begin{example3}
    139 rel:sameAs & owl:equivalentProperty & owl:sameAs
    140 \end{example3}
    141 
    142 to enable the inference of the equivalences?
    143 
    144 Is this correct:
    145 \end{note}
    146 ?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
    147 
    148 \begin{example2}
    149  cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
    150 \end{example2}
    151 
    152 \noindent
    153 following facts need to be present in the ontology :
    154 
    155 \begin{example3}
    156 <lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
    157 cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
    158 isocat:DC-2538 & rel:sameAs & dc:created \\
    159 rel:sameAs & owl:equivalentProperty &  owl:sameAs \\
    160 $\rightarrow$ \\
    161 <lr1> & dc:created & 2012\^{}\^{}xs:year \\
    162 \end{example3}
    163 
    164 \noindent
    165 What about other relations we may want to express? (Do we need them and if yes, where to put them? – still in RR?) Examples:
    166 
    167 \begin{example3}
    168 cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
    169 clavas:Organization & owl:subClassOf & dcterms:Agent \\
    170 <org1> & a & clavas:Organization \\
    171 \end{example3}
     114By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
     115
     116\begin{example3}
     117rel:sameAs & rdfs:subPropertyOf & owl:sameAs
     118\end{example3}
     119
     120
    172121
    173122\subsection{CMD instances}
     
    177126
    178127It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
    179 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}:
     128If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
     129(Note also, that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
    180130
    181131\begin{example3}
    182132\_:anno1  & a & oa:Annotation; \\
    183  & oa:hasTarget  & <lr1>; \\
     133 & oa:hasTarget  & <lr1a>, <lr1b>; \\
    184134 & oa:hasBody  & <lr1.cmd>; \\
    185135 & oa:motivatedBy  & oa:describing \\
     
    192142\begin{example3}
    193143<lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
    194  & dcterms:creator ??  & "\var{\{cmd:MdCreator\}}";  \\
    195  & dcterms:publisher  & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\
    196  & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\
     144 & dcterms:creator & "\var{\{cmd:MdCreator\}}";  \\
     145 & dcterms:publisher  & <http://clarin.eu>\\
     146 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" \\
    197147\end{example3}
    198148
     
    207157& ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    208158\end{example3}
    209 
    210 \noindent
    211 ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?
    212 Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
    213 This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
    214 Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
    215 
    216 \todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data}
    217 
    218 \begin{example3}
    219 \_:mdcoll  & a   & ore:ResourceMap; \\
    220  & rdfs:label & "Collection 1"; \\
    221 \_:mdcoll\#aggreg & a   & ore:Aggregation \\
    222  & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    223 \end{example3}
    224159       
    225160\subsubsection{Components – nested structures}
    226161
    227 There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components:
    228 
    229 \begin{enumerate}[a)]
    230 \item the components are encoded as object property
    231 
    232 \begin{example3}
    233 <lr1>  & cmd:Actor  & \_:Actor1 \\
    234 <lr1>  & cmd:Actor  & \_:Actor2 \\
    235 \_:Actor1  & cmd:motherTongue  & iso-639:aac \\
    236 \_:Actor2  & cmd:motherTongue  & iso-639:deu \\
    237 \_:Actor1  & cmd:role & "Interviewer" \\
    238 \_:Actor2 & cmd:role & "Speaker" \\
    239 \end{example3}
    240 
    241 \item a dedicated object property is used
     162For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
    242163
    243164\begin{example3}
     
    246167\end{example3}
    247168
    248 \end{enumerate}
    249 
    250169\subsection{Elements, Fields, Values}
    251170Finally, we want to integrate also the actual field values in the CMD records into the ontology.
    252171
    253 \subsubsection{Predicates}
    254 As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property:
     172% \subsubsection{Predicates}
     173As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
     174
     175Following example show the whole chains of statements from metamodel to literal value:
    255176
    256177\begin{example3}
    257178cmd:timeCoverage  & a   & cmds:Element \\
     179cmd:timeCoverageValue & a & cmds:ElementValue \\
    258180cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
    259 <lr1>  & cmd:timeCoverage  & "19th century" \\
    260 
    261 \end{example3}
    262 
    263 \subsubsection{Literal values -- data properties}
    264 
    265 To generate triples with literal values is straightforward:
    266 
    267 \begin{definition}{Literal triples}
    268 lr:Resource \ \quad cmds:Property \ \quad xsd:string
    269 \end{definition}
    270 
    271 \begin{example3}
    272 <lr1> & cmd:Organisation & "MPI" \\
    273 \end{example3}
    274 
    275 \subsubsection{Mapping to entities -- object properties}
    276 
    277 The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities:
    278 
    279 \begin{definition}{new RDF triples}
    280 lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI
    281 \end{definition}
    282 
    283 \begin{example3}
    284 <lr1> & cmd:Organisation\_? & <org1> \\
    285 \end{example3}
    286 
    287 \begin{note}
     181<lr1> & cmd:contains & \_:timeCoverage1 \\
     182\_:timeCoverage1 & a & cmd:timeCoverage \\
     183\_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\
     184\end{example3}
     185
     186
     187While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities:
     188
     189\begin{example3}
     190\var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\
     191\_:organisation1 & cmd:OrganisationValue\_? & <org1> \\
     192\end{example3}
     193
     194\begin{comment}
    288195Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
    289196i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
    290 \end{note}
    291 
    292 The mapping process is detailed in \ref{sec:values2entities}
    293 
    294 %%%%%%%%%%%%%%%%%55
     197\end{comment}
     198
     199The mapping process is detailed in \ref{sec:values2entities}.
     200
     201
     202
     203%%%%%%%%%%%%%%%%%
    295204\section{Mapping field values to semantic entities}
    296205\label{sec:values2entities}
     
    310219We don't try to achieve complete ontology alignment, we just want to find
    311220for our ``anonymous'' concepts semantically equivalent concepts from other ontologies.
    312 This is very near just other phrasing for the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:
     221This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:
    313222``for each concept (node) in ontology A [tries to] find a corresponding concept
    314223(node), which has the same or similar semantics, in ontology B and vice verse''.
    315224
    316225The first two points in the above enumeration represent the steps necessary to be able to apply the ontology mapping.
    317 The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated ontology to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies.
    318 
    319 
    320 The transformation of the data has been partly described in previous section:
    321 It can be trivially automatically converted into RDF triples as :
    322 
    323 \begin{example3}
    324 <lr1> & cmd:Organisation & "MPI" \\
    325 \end{example3}
    326 
    327 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs:
    328 
    329 \begin{example3}
    330 \_:1 & a & cmd:Organisation;\\
     226The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated semantic resource to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies.
     227
     228The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as :
     229
     230\begin{example3}
     231\_:organisation1 & cmd:OrganisationValue & "MPI" \\
     232\end{example3}
     233
     234However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs (cf. figure \ref{fig:smc_cmd2lod}):
     235
     236\begin{example3}
     237\_:1 & a & clavas:Organisation;\\
    331238   & skos:altLabel & "MPI";
    332239\end{example3}
     
    345252\subsubsection{Identify vocabularies}
    346253
    347 \todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
    348 LT-World \cite{Joerg2010}
    349 
    350 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
     254One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}) . For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
    351255
    352256The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
     
    380284\end{definition}
    381285
    382 In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
     286In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories,
    383287which will be the result of the previous step.
    384288
     
    409313\label{sec:lod}
    410314
    411 
    412 With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
    413 
    414 Namely to enhance it by employing ontological resources.
    415 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
    416 
    417 
    418 SPARQL
    419 
    420 rechercheisidore, dbpedia, ...
    421 
    422 
    423 \cite{Europeana RDF Store Report}
    424 
    425 Technical aspects (RDF-store?): Virtuoso
    426 
    427 
    428 semantic search component in the Linked Media Framework
    429 
    430 \todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
    431 
    432 
    433 %\section {Full semantic search - concept-based + ontology-driven ?}
    434 %\label{semantic-search}
    435 
     315With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources.
     316The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types.
     317
     318The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}.
     319Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
     320
     321
     322Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
    436323
    437324\section{Summary}
    438325
    439 %The task can be also seen as building bridge between XML resources and semantic resources expressed in RDF, OWL.
    440 
    441 The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
    442 
    443 
    444 In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
    445 
     326In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the method to translate the string values in metadata fields to corresponding semantic entities.
     327This task can be also seen as building a bridge between the world XML resources and semantic resources expressed in RDF.
     328Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
     329
     330%The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
  • SMC4LRT/chapters/Design_SMCschema.tex

    r3680 r3776  
    1212The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production-side registries and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain.
    1313
    14 \begin{figure*}[!ht]
     14\begin{figure*}
    1515\includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
    1616\caption{The component view on the SMC - modules and their inter-dependencies}
     
    4545
    4646\subsection{smcIndex}\label{def:smcIndex}
    47 In this section, we describe \code{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
    48 
    49 An \code{smcIndex} is a human-readable string adhering to a specific syntax, denoting some search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
     47In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
     48
     49An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
    5050
    5151\begin{defcap}
    52 \caption{Grammar of \code{smcIndex}}
     52\caption{Grammar of \var{smcIndex}}
    5353\begin{align*}
    5454smcIndex &::= dcrIndex \ | \ cmdIndex  \\
     
    6767\end{defcap}
    6868
    69 The grammar distinguishes two main types of \code{smcIndex}: a) \code{dcrIndex} referring to data categories and b) \code{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).
    70 These two types of \code{smcIndex} follow different construction patterns.
    71 \code{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \code{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
    72 
    73 It is important to note, that in general -- by design -- \code{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
     69The grammar distinguishes two main types of \var{smcIndex}: a) \var{dcrIndex} referring to data categories and b) \var{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).
     70These two types of \var{smcIndex} follow different construction patterns.
     71\var{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \var{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
     72
     73It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
    7474Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
    7575However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
    7676
    77 \code{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \code{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \code{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
    78 
    79 \code{profile} is reference to a CMD profile. Again, dealing with the ambiguity, it can be either the name of the profile \code{profileName} or its identifier \code{profileId} as issued by the Component Registry (e.g. \code{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
     77\var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
     78
     79\var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
    8080
    8181\begin{example1}
     
    8484\end{example1}
    8585
    86 \noindent
    87 \code{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to narrow down the ambiguity.
     86%\noindent
     87\var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
    8888
    8989\subsection{Terms}
    9090\label{datamodel-terms}
    9191
    92 In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}.
     92Here we describe the XML schema for internal representation of the processed data.
     93In abstract terms, the internal format is basically a table with information about indexes collected from the upstream registries or created during preprocessing. \code{Term} is main entity that represents either a label of a data category, or a CMD entity (a CMD  component or element). \code{Termset} represents a logical collection of \code{Terms} (one profile or data categories of one type). \code{Concept} represents a data category and groups all corresponding terms. \code{Relation} is used to express relation between two \code{Concepts}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}.
    9394
    9495\subsubsection{Type \code{Term}}
     
    9697\code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
    9798
    98 \begin{table}[ht]
     99\begin{table}[h]
    99100\caption{Attributes of \code{Term} when encoding data category}
    100101\label{table:terms-attributes-datcat}
    101  \begin{tabular}{ l | l | l }
    102   attribute & allowed values & sample value\\
     102 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X }
     103\hline
     104\rowfont{\itshape\small}   attribute & allowed values & sample value\\
    103105\hline
    104106  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
     
    106108  \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
    107109 \var{xml:lang} & two-letter language code (only for ISOcat) & \code{en}, \code{si} \\
    108  \end{tabular}
     110\hline
     111 \end{tabu}
    109112\end{table}
    110113
    111 %\captionsetup{justification=raggedright, singlelinecheck=false}
    112 \lstset{language=XML}
    113 \begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
    114 <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
    115         type="label" xml:lang="fr">nom de ressource</Term>
    116 \end{lstlisting}
    117 
    118 \begin{table}[ht]
     114\begin{table}[h]
    119115\caption{Attributes of \code{Term} when encoding CMD entity}
    120116\label{table:terms-attributes-cmd}
    121 \begin{tabularx}{1\textwidth}{ l | X | X }
    122  %\begin{tabu}{1\textwidth}{ l | l | l }
    123   attribute & allowed values & sample value\\
     117 \begin{tabu}{ p{0.1\textwidth}  p{0.4\textwidth} >{\footnotesize}X }
     118\hline
     119\rowfont{\itshape\small}   attribute & allowed values & sample value\\
    124120\hline
    125121  \var{id} &  \var{cmdEntityId} as defined in \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1290431694487\#Url} \\
    126   \var{type} &  one of ['CMD\_Element', 'CMD\_Component'] & \code{CMD\_Element}\\
     122  \var{type} & {\footnotesize \code{CMD\_Element} | \code{CMD\_Component} } & \code{CMD\_Element}\\
     123  \var{datcat} &  reference to the data category, URL or \var{dcrIndex} & \code{isocat:DC-2546}\\
    127124  \var{name} & name of the component or element & \code{Url} \\
    128125  \var{path} &  \var{dotPath} (cf. \ref{def:smcIndex}) & \code{SpeechCorpus.Access.Contact.Url} \\
    129126  \var{parent} & name of the parent component &  \code{Contact} \\
    130  \end{tabularx}
     127\hline
     128 \end{tabu}
    131129\end{table}
    132130
    133 \lstset{language=XML}
    134 \begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
    135 <Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546"
    136           id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact"
    137           path="SpeechCorpus.Access.Contact.Url"/>
    138 \end{lstlisting}
    139 
    140 \begin{table}[ht]
    141 \caption{Attributes of \code{Term} when encoding a term in the inverted index?}
     131\begin{table}
     132\caption{Attributes of \code{Term} when encoding a CMD entity in the inverted index}
    142133\label{table:terms-attributes-index}
    143  \begin{tabularx}{1\textwidth}{ l | X | X }
    144   attribute & allowed values & sample value\\
     134 \begin{tabu}{ p{0.1\textwidth}  p{0.4\textwidth} >{\footnotesize}X }
     135\hline
     136\rowfont{\itshape\small}   attribute & allowed values & sample value\\
    145137\hline
    146138  \var{id} &  \var{cmdEntityId} cf. \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1359626292113 \#ResourceTitle} \\
    147   \var{type} &  one of \code{['id', 'mnemonic', 'label', 'full-path']} & \code{full-path}\\
     139 \var{set} & denotion of the containing termset & \code{cmd} \\
     140  \var{type} &  one of \code{full-path} or \code{min-path} & \code{full-path}\\
    148141  \var{schema}  & \var{profileID} & \code{clarin.eu:cr1:p\_1357720977520} \\ 
    149   \var{concept-id} & id of the corresponding (data category) &  \var{isocat:}\code{DC-2545} \\
     142%  \var{concept-id} & id of the corresponding (data category) &  \var{isocat:}\code{DC-2545} \\
    150143  \var{node-value} &  \var{dotPath} & \code{SpeechCorpus.Access.Contact.Url} \\
    151  \end{tabularx}
     144\hline
     145 \end{tabu}
    152146\end{table}
    153147
     148%\captionsetup{justification=raggedright, singlelinecheck=false}
     149\lstset{language=XML}
     150\begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
     151  <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
     152             type="label" xml:lang="fr">nom de ressource</Term>
     153\end{lstlisting}
     154
     155\lstset{language=XML}
     156\begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
     157  <Term type="CMD_Element" name="Url" id="clarin.eu:cr1:c_1290431694487#Url"
     158             parent="Contact" datcat="http://www.isocat.org/datcat/DC-2546"
     159             path="SpeechCorpus.Access.Contact.Url"/>
     160\end{lstlisting}
     161
    154162\lstset{language=XML}
    155163\begin{lstlisting}[label=lst:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
    156    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
    157                 id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
    158                 concept-id="http://www.isocat.org/datcat/DC-2545" >
     164  <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
     165             id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
     166             concept-id="http://www.isocat.org/datcat/DC-2545" >
    159167        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle
    160    </Term>
     168  </Term>
    161169\end{lstlisting}
    162170
    163171
    164172\subsubsection{Type \code{Concept}}
    165 \code{Concept} represents a data category. Identifier is the PID issued by the DCR.
     173\code{Concept} represents a data category. Identifier is the PID issued by the DCR encoded in the \var{id} attribute.
    166174It groups all terms belonging to given data category.
    167175The content model is a sequence of \code{Terms} followed by a sequence of \code{info} elements.
    168 Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition potentially in different languages:
     176Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} (in multiple languages) encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition (also potentially in different languages). In the inverted index, the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:dcr-cmd-map}).
     177
    169178
    170179\lstset{language=XML}
    171180\begin{lstlisting}[label=lst:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
    172 <Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat"
    173                id="http://www.isocat.org/datcat/DC-2545">
    174          <Term set="isocat" type="mnemonic">resourceTitle</Term>
    175          <Term set="isocat" type="id">DC-2545</Term>
    176          <Term set="isocat" type="label" xml:lang="en">resource title</Term>
    177          <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term>
     181  <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
     182    <Term set="isocat" type="mnemonic">resourceTitle</Term>
     183    <Term set="isocat" type="id">DC-2545</Term>
     184    <Term set="isocat" type="label" xml:lang="en">resource title</Term>
     185    <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term>
     186    ...
     187    <info xml:lang="en">The title is the complete title
     188                of the resource without any abbreviations.</info>
     189     ...
     190  </Concept>
     191\end{lstlisting}
     192
     193%\lstset{language=XML}
     194%\begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to %\code{Concept}]
     195% <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
     196%            id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
     197%\end{lstlisting}
     198
     199\lstset{language=XML}
     200\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}]
     201  <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
     202    <Term set="isocat" type="mnemonic">resourceTitle</Term>
     203    <Term set="isocat" type="id">DC-2545</Term>
     204    <Term set="isocat" type="label" xml:lang="en">resource title</Term>
     205    <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term>
     206    <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term>
     207      ...
     208    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
     209            id="clarin.eu:cr1:c_1359626292113#ResourceTitle">
     210        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term>
     211    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
     212            id="clarin.eu:cr1:c_1271859438123#Title">
     213        AnnotationTool.GeneralInfo.Title</Term>
     214    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
     215            id="clarin.eu:cr1:c_1271859438201#Title">
     216        Session.Title</Term>
    178217        ...
    179          <info xml:lang="en">The title is the complete title
    180                         of the resource without any abbreviations.</info>
    181         ...
    182 </Concept>
    183 \end{lstlisting}
    184 
    185 In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:concept-cmd-term}).
    186 
    187 \lstset{language=XML}
    188 \begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
    189  <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
    190             id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
    191 \end{lstlisting}
    192 
    193 \lstset{language=XML}
    194 \begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}]
    195     <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
    196         <Term set="isocat" type="mnemonic">resourceTitle</Term>
    197         <Term set="isocat" type="id">DC-2545</Term>
    198         <Term set="isocat" type="label" xml:lang="en">resource title</Term>
    199         <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term>
    200         <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term>
    201         ...
    202         <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
    203                 id="clarin.eu:cr1:c_1359626292113#ResourceTitle">
    204                         AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term>
    205         <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
    206                 id="clarin.eu:cr1:c_1271859438123#Title">
    207                         AnnotationTool.GeneralInfo.Title</Term>
    208         <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
    209                 id="clarin.eu:cr1:c_1274880881884#Title">
    210                         imdi-corpus.Corpus.Title</Term>
    211         <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
    212                 id="clarin.eu:cr1:c_1271859438201#Title">
    213                         Session.Title</Term>
    214         ...
    215     </Concept>
    216 \end{lstlisting}
    217 
     218  </Concept>
     219\end{lstlisting}
     220%    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
     221  %          id="clarin.eu:cr1:c_1274880881884#Title">
     222     %   imdi-corpus.Corpus.Title</Term>
     223
     224\subsubsection{Type \code{Relation}}
     225As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts.
     226
     227% role="about"
     228\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Internal representation of the relation between concepts]
     229  <Relation type="sameAs">
     230    <Concept type="datcat" id="http://www.isocat.org/datcat/DC-2484"/>
     231    <Concept type="datcat" id="http://purl.org/dc/elements/1.1/language"/>
     232  </Relation>
     233\end{lstlisting}
    218234
    219235\subsubsection{Type \code{Termsets/Termset}}
    220 \code{Termset} groups a set of terms as outlined in \ref{table:cx-list-params}. It is identified by the \code{@set} attribute.
    221 For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile.
    222 
    223 Finally, \code{Termsets} is a root element grouping \code{Termset} elements.
     236\code{Termset} groups a set of terms. (Possible termsets are listed in table \ref{table:cx-list-params}.) It is identified by the \code{@set} attribute.
     237For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile. The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements. Finally, \code{Termsets} is a root element grouping \code{Termset} elements.
    224238
    225239\lstset{language=XML}
    226240\begin{lstlisting}[label=lst:termset, caption=\code{Termset} element representing a CMD profile]
    227 <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
     241  <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
    228242            type="CMD_Profile">
    229       <info>
    230          <id>clarin.eu:cr1:p_1357720977520</id>
    231          <description>A CMDI profile for annotated text corpus resources.</description>
    232          <name>AnnotatedCorpusProfile</name>
    233          <registrationDate>2013-01-31T11:57:12+00:00</registrationDate>
    234          <creatorName>nalida</creatorName>
    235           ...
    236      </info>
    237      <Term type="CMD_Component" name="GeneralInfo" datcat=""
     243    <info>
     244      <id>clarin.eu:cr1:p_1357720977520</id>
     245      <description>A CMDI profile for annotated text corpus resources.
     246      </description>
     247      <name>AnnotatedCorpusProfile</name>
     248      <registrationDate>2013-01-31T11:57:12+00:00</registrationDate>
     249      <creatorName>nalida</creatorName>
     250      ...
     251   </info>
     252   <Term type="CMD_Component" name="GeneralInfo" datcat=""
    238253            id="clarin.eu:cr1:c_1359626292113"     
    239254            parent="AnnotatedCorpusProfile"
    240255            path="AnnotatedCorpusProfile.GeneralInfo">
    241             <Term ...
     256       <Term ...
    242257     </Term>
    243258     ...
    244 </Termset>
    245 \end{lstlisting}
    246 
    247 The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements.
    248 
     259  </Termset>
     260\end{lstlisting}
    249261
    250262%%%%%%%%%%%%%%%%%%%%%%
     
    255267Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
    256268
    257 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
    258 
    259 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields.
     269The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
     270
     271The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
    260272
    261273\subsection{Interface Specification}
     
    264276In this section, we define the abstract interface of the proposed service, in terms of the input parameters and output data format.
    265277
    266 \todoin{The two interfaces list and map
    267 Full definition in appendix and under link!}
    268 
    269 \subsubsection*{Method \code{list}}
    270 
    271 Method \code{list} lists available items for given context or type. This allows the client applications to configure the query input  and provide autocompletion functionality.
    272 
    273 \begin{definition}{URI-pattern of the \code{list} method}
     278%\todoin{The two interfaces list and map Full definition in appendix and under link!}
     279
     280\subsubsection*{Method \var{list}}
     281
     282Method \var{list} lists available items for given context or type. This allows the client applications to configure the query input and provide autocompletion functionality. Table \ref{table:cx-list-params} lists the accepted values for the \var{\$context} parameter and the corresponding types of returned data.
     283
     284\begin{definition}{URI-pattern of the \var{list} method}\label{def:list-method}
    274285/smc/cx/list/\$context
    275286\end{definition}
    276 
    277 \noindent
    278 Table \ref{table:cx-list-params} lists the allowed values for the \var{\$context} parameter and the corresponding types of returned data
    279287
    280288\begin{table}
    281289\caption{Allowed values for parameters of the \code{list}-method and corresponding return values}
    282290\label{table:cx-list-params}
    283  \begin{tabular}{ l | p{0.7\textwidth} }
    284   \var{\$context}  & returns a list of \\
    285  \hline
     291% \begin{tabular}{ l | p{0.7\textwidth} }
     292%  \var{\$context}  & returns a list of \\
     293 \begin{tabu}{ l p{0.7\textwidth} }
     294\hline
     295\rowfont{\itshape\small} \$context & returns a list of \\
     296\hline
    286297  \code{*,top} & available termsets \\
    287298  \var{\{termset\}} & terms (CMD components and elements) of given termset \\
     
    292303  \code{cmd-full-paths} & all complete (starting from Profile) \emph{dotPaths} to CMD components and elements\\
    293304  \code{cmd-minimal-paths} & reduced but still unique paths to CMD components and elements \\
    294   \code{relsets} & available relation sets (defined in the Relation Registry)
    295  \end{tabular}
     305  \code{relsets} & available relation sets (defined in the Relation Registry) \\
     306\hline
     307\end{tabu}
    296308\end{table}
    297309
    298  Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
     310\subsubsection*{Method \var{explain} }
     311The service also has to deliver additional information about the indexes like description and a link to the definition of the entity in the source registry.
     312
     313\begin{definition}{URI-pattern of the \code{explain} method}\label{def:explain-method}
     314/smc/cx/explain/\{\$context\} \ [ \ /\{\$term\} \ ] \ [ \ ?format=\$format \ ] \ [ \ ?lang=\$lang \ ]
     315\end{definition}
     316
     317\begin{example1}
     318/smc/cx/explain/cmd/clarin.eu:cr1:p\_1357720977520 \\
     319/smc/cx/explain/isocat/DC-2506?lang=et,pt
     320\end{example1}
     321
     322\lstset{extendedchars=false,
     323escapeinside='', language=XML}
     324\begin{lstlisting}[label=lst:sample-explain, caption=Sample output of the \var{explain} function for a data category]
     325  <Concept type="datcat" id="http://www.isocat.org/datcat/DC-2506">
     326    <Term set="isocat" type="mnemonic">annotationMode</Term>
     327    <Term set="isocat" type="id">DC-2506</Term>
     328    <Term set="isocat" type="label" xml:lang="et">m'À'rgendusviis</Term>
     329    <Term set="isocat" type="label" xml:lang="pt">modo de anota'çã'o</Term>
     330    <info xml:lang="et">N'À'itab, kas ressurss m'À'rgendati
     331                                  k'À'sitsi v'\~{o}'i automaatselt.</info>
     332    <info xml:lang="pt">Indica se o recurso foi criado manualmente
     333                                  ou por processo autom'á'tico.</info>
     334</Concept>
     335\end{lstlisting}
     336
    299337%NO (this will be handled by the servic as multililngual labels e) : or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.}
    300338% While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices.
    301339
    302340
    303 \subsubsection*{Method \code{map} }
    304 
    305 Method \code{map} performs the actual translations:
     341\subsubsection*{Method \var{map} }
     342
     343Method \var{map} performs the actual translations:
    306344it accepts any index (adhering to the \var{smcIndex} datatype, cf. \ref{def:smcIndex}) and returns a list of corresponding indexes.
    307345%it returns list of equivalent terms/smcIndexes for a given term/smcIndex.
    308346
    309 \begin{definition}{General function definition}
    310 smcIndex \mapsto smcIndex[ ]
     347\begin{definition}{General function definition}\label{def:map-method-general}
     348smcIndex \mapsto smcIndex*
    311349\end{definition}
    312350
    313 \begin{definition}{URI-pattern of the \code{map} method}
     351\begin{definition}{URI-pattern of the \var{map} method}
    314352/smc/cx/map/\{\$context\}/\{\$term\} \ [ \ ?format=\{\$format\} \ ] \ [ \ \&relset=\{\$relset\} \ ]
    315353\end{definition}
    316354
    317355\noindent
    318 Parameter definition:\\*
     356Parameter definition:
    319357\begin{description}
    320 \item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this would be one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted
     358\item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this is one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted
    321359\item[\var{\$term}] \var{smcIndex} term (without the context prefix); the term is used to lookup a concept, to deliver the list of equivalent indexes; case-insensitive
    322360\item[\var{\$format}] the desired result format can be indicated explicitely, alternatively to default content negotiation; one of \code{[json, rdf, xml]}; \code{xml} is default
    323 \item[\var{\$relset}] optional; reference to a relset to be applied on the identified concept to expand the cluster of equivalent ; allows multiple values from \code{list/relsets}; if multiple sets are they are all applied in the expansion
     361\item[\var{\$relset}] optional; reference to a relation set to be combined with the identified concept to expand the cluster of matching concepts; allows multiple values from \code{list/relsets}; if multiple sets are listed they are all applied in the expansion
    324362\end{description}
    325363
     
    327365Possible return formats:
    328366\begin{description}
    329 \item[\var{'', default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output})
    330 
    331 
     367\item[\var{default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output})
    332368\item[\var{schema}] distinct schemas (\code{Termset}) referencing given data category or string
    333369\lstset{language=XML}
     
    335371<Termset schema="clarin.eu:cr1:p_1295178776924" name="serviceDescription"/>
    336372\end{lstlisting}
    337 \item[\var{datcat}] distinct data categories (\code{Term@id@da}) by \code{@concept-id}
     373\item[\var{datcat}] distinct data categories, by grouping the \code{Term@datcat} attribute of the matching terms 
    338374\lstset{language=XML}
    339375\begin{lstlisting}
     
    341377           set="isocat" type="datcat">creatorFullName</Term>
    342378\end{lstlisting}
    343 \item[\var{cmdid, id}] distinct cmd entities (\code{Term}) by \code{@id}
     379\item[\var{cmdid, id}] distinct cmd entities grouped by \code{@id}
    344380\begin{lstlisting}
    345381<Term type="CMD_Element" name="Name" elem="Name" parent="Session"
     
    350386\end{description}
    351387
    352 \begin{table}[ht]
    353 \caption{Sample values for parameters of the \code{map}-method and corresponding return values}
    354 \label{table:cx-map-params}
    355 
    356  \begin{tabular}{ l  l | l}
    357   \var{\$context}  & \var{\$term} & returns \\
    358  \hline
    359   \code{*} & \code{name} & ? \\
    360   \code{isocat} & \code{resourceTitle} & CMD terms \\
    361   \code{cmd} & \code{name} & \\
    362 
    363  \end{tabular}
    364 \end{table}
    365 
    366388\noindent
    367389Sample request\\*
     
    371393\lstset{language=XML}
    372394\begin{lstlisting}[label=lst:map-output, caption=Corresponding sample output ]
    373 <Terms >
     395<Termset>
    374396    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
    375         id="clarin.eu:cr1:c_1271859438123#Title">
    376                 AnnotationTool.GeneralInfo.Title</Term>
     397                id="clarin.eu:cr1:c_1271859438123#Title">
     398            AnnotationTool.GeneralInfo.Title</Term>
    377399    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1288172614014"
    378         id="clarin.eu:cr1:c_1288172614011#resourceTitle">
    379                 BamdesLexicalResource.BamdesCommonFields.resourceTitle
     400                id="clarin.eu:cr1:c_1288172614011#resourceTitle">
     401            BamdesLexicalResource.BamdesCommonFields.resourceTitle
    380402     </Term>
    381403   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
    382         id="clarin.eu:cr1:c_1274880881884#Title">
    383                 imdi-corpus.Corpus.Title</Term>
     404                id="clarin.eu:cr1:c_1274880881884#Title">
     405            imdi-corpus.Corpus.Title</Term>
    384406   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
    385         id="clarin.eu:cr1:c_1271859438201#Title">
    386                 Session.Title</Term>
     407                id="clarin.eu:cr1:c_1271859438201#Title">
     408            Session.Title</Term>
    387409   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1272022528363"
    388         id="clarin.eu:cr1:c_1271859438123#Title">
    389                 LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term>
     410                id="clarin.eu:cr1:c_1271859438123#Title">
     411            LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term>
    390412    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1284723009187"
    391         id="clarin.eu:cr1:c_1271859438123#Title">collection.GeneralInfo.Title</Term>
     413                id="clarin.eu:cr1:c_1271859438123#Title">
     414            collection.GeneralInfo.Title</Term>
    392415\end{lstlisting}
    393416
     
    420443
    421444\noindent
    422 (3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and element, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.
     445(3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and elements, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.
    423446Having concept links also on components will require a compositional approach for the mapping function, resulting in:
    424447\begin{example2}
     
    429452\subsection{Implementation}
    430453
    431 The core functionality  of the SMC is implemented as a set of XSL-stylesheets
    432 
    433454At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries.
    434 
    435 \todoin{generate and reference XSLT-documentation}
     455The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}.
    436456
    437457The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
     
    440460\subsubsection{Initialization}
    441461\label{smc_init}
    442 During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
    443 
    444 \begin{definition}{Principal structure of the inverted index}
    445 datcatURI \mapsto profile.component.element[]
     462During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories.\ref{def:inverted-index}
     463
     464\begin{definition}{Principal structure of the inverted index}\label{def:inverted-index}
     465datcatPID \mapsto profile.component.element*
    446466\end{definition}
    447467
    448468The collected data categories are enriched with information from corresponding registries (DCRs), adding the label, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
    449 
    450469Finally, relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
    451470
    452 \begin{figure*}[!ht]
     471\begin{figure*}
    453472\includegraphics[width=1\textwidth]{images/smc_init.png}
    454473\caption{The various stages of the data flow during the initialization}
     
    461480\item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
    462481\item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
    463 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements
     482\item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label}
    464483\item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
    465484\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
     
    467486
    468487\subsubsection{Operation}
    469 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL-stylesheets for post-processing depending on requested format.
    470 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq}-library within a \xne{eXist} XML-database.
     488For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
     489The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database.
    471490
    472491\subsection{Extensions}
     
    474493Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
    475494
    476 Also, use of \emph{other than equivalency} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
     495Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
    477496
    478497\section{qx -- concept-based search}
    479498\label{sec:qx}
    480499To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
    481 In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
     500In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
    482501
    483502The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
    484503
    485 Note, that \emph{query expansion} yet needs to distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
    486 
    487 Note, also that this chapter deals only with the schema-level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The corresponding instance level is tackled in \ref{semantic-search}.
     504Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.
     505
     506Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
    488507
    489508\subsection{Query language}
     509\label{cql}
    490510As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
     511CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks.
     512It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
     513transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
     514
     515Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata.
     516However, the protocol is defined in a very generic way, with a strong focus on extensibility.
     517It is equally suitable for content search.
     518\begin{comment}
     519The protocol part (SRU) defines three major operations:
     5201) \emph{explain}: in which the target repository announces its particular configuration (e.g. available indices),
     5212) \emph{scan}:  informing about terms available in/for given index, and
     5223) \emph{searchRetrieve}: returning a search result based on a CQL query.
     523\end{comment}
     524
     525The query language part (CQL - Context Query Language) defines a relatively complex and complete query language.
     526The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators.
     527In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
     528
     529The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well,  given the inherent interrelation between metadata and content search.
    491530
    492531\subsection{Query Expansion}
     
    501540\end{example1}
    502541
    503 \noindent
     542%\begin{note}
    504543Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
     544%\end{note}
    505545
    506546\subsection{SMC as module for Metadata Repository}
     
    508548As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
    509549
    510 Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
    511 
    512 There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
    513 
    514 
    515 \begin{figure*}[!ht]
     550Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query.
     551
     552\begin{figure*}
     553\begin{center}
    516554\includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
    517555\caption{The component view on the SMC - modules and their inter-dependencies}
    518556\label{fig:modules-mdrepo}
     557\end{center}
    519558\end{figure*}
    520559
     
    522561\subsection{User Interface}
    523562
    524 A starting point for our considerations is the traditional structure found in many (advanced) search interface, which is basically a an array of index - term pairs, or in more advanced alternatives: tuples of index, comparison operator, term and boolean operator:
     563A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
    525564\begin{definition}{Generic data format for structured queries}
    526  [ < index, operation, term, boolean > ]
     565 < index, operation, term, boolean >+
    527566\end{definition}
    528567
    529 \noindent
    530 This maps trivially to the main clause of the CQL syntax, the \var{searchClause} \ref{def:searchClause}.
    531568% {Basic clause of the CQL syntax}
    532 \begin{definition}{The main clause of the CQL syntax, the \code{searchClause}}
     569\begin{definition}{The basic \code{searchClause} of the CQL syntax}
    533570\label{def:searchClause}
    534571searchClause \ ::= \ index \ relation \ searchTerm
    535572\end{definition}
    536573
    537 \noindent
    538 An alternative would be a smart parsing input field with contextual autocomplete. Though such a widget would still share the underlying data model.
    539 
    540 \begin{figure*}[!ht]
     574\begin{figure*}
     575\begin{center}
    541576\includegraphics[width=0.8\textwidth]{images/query_input_autocomplete_term.png}
    542577\caption{A proposed query input interface offering concepts as search indexes}
    543578\label{fig:query_input}
     579\end{center}
    544580\end{figure*}
    545581
    546582\noindent
    547583Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
    548 
    549 A fundementally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
    550 
    551 Although we concentrate on query input, the use of indexes has to be consistent across, be it in labeling the fields of the results, or when providing facets to drill down the search.
    552 
    553 
     584Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search.
     585
     586A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
     587
     588Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
     589
     590
     591%%%%%%%%%%%%%%%%%%%%%%%%%%
    554592\section{SMC Browser}
    555593\label{smc-browser}
     
    597635\includegraphics[width=1\textwidth]{images/smc-browser_UIsketch.png}
    598636\end{center}
    599 \caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface}
     637\caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface and the update dependencies}
    600638\label{fig:smc-browser_sketch}
    601639\end{figure*}
     
    604642Prospective parts of the application layout (cf. figure \ref{fig:smc-browser_sketch}):
    605643\begin{description}
    606 \item[index panel] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane
     644\item[index pane] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane
    607645\item[main graph pane] displays the selected subgraph, needs as much space as possible
    608646\item[graph navigation bar] for manipulation of the displayed graph by various means
    609647\item[detail view] displaying definition and statistical information for selected nodes
    610648\item[statistics] a separate view on the data listing the statistical information for whole dataset in tables
     649\item[notifications] a widget to provide feedback about the system status to the user
    611650\end{description}
    612651
     
    634673\item[profiles + datcats + datcats + groups + rr]
    635674        as above but again with profile-groups and relations
    636 \item[only profiles]
     675\item[profiles similarity]
    637676       just profiles with links between them representing the degree of similarity based on the reuse of components and data categories
    638677\end{description}
     
    692731
    693732%%%%%%%%%%%%%%%%%%%%%%%%%
    694 \section{Application of Schema Matching techniques in SMC}
     733\section{Application of \emph{schema matching} techniques in SMC}
    695734\label{sec:schema-matching-app}
    696735
    697736Even though the described module is about ``semantic mapping'',  until now  we did not directly make use of the traditional ontology/schema mapping/alignment methods and tools as summarized in \ref{lit:schema-matching}. This is due
    698 to the fact that the in this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation,
     737to the fact that in this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation,
    699738to a high degree obsoleting the need for a posteriori complex schema matching/mapping techniques.
    700 Or put in terms of the schema matching methodology, the system relies on explicitely set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
     739Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
    701740
    702741However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
    703742
    704743Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
    705 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{We talk of schema even though the creation (and also remodelling) takes place in the component registry by creating CMD profiles and components, because every profile has an unambiguous expression in XML Schema.} \var{$S_{1..n}$}.
    706 It is very unprobable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
    707 Given the heterogenity of the schemas present in the field of research, full alignments are not achievable at all.
     744The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}.
     745It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
     746Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all.
    708747However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
    709748components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
    710 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, she is helped even with candidates that are not equivalent, thus we can further relax the task and allow even candidates that are just similar to a certain degree, that can be operationalized as threshold $t$ on the output of the \var{similarity} function
     749Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
    711750Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
    712751
     
    723762
    724763Next to the usual features and measures that can be applied like label equality or string-similarity and structural equality,
    725 the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}.
    726 
    727 It would be worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature.
    728 longest matching subpath.
    729 
     764the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath).
    730765
    731766Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
    732767
    733 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalency is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well, thus by default the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
     768Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
    734769
    735770The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
    736771However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
    737772 
    738 Once all the equivalencies (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
    739 This new simliarity ratios could be applied as alternative weights in the just-profiles graph \ref{sec:smc-cloud}.
    740 
    741 In contrast to the task described here, that -- restricted matching XML schemas -- can be seen as staying in the ``XML World'',
    742 another aspect within this work is clearly situated in the Semantic Web world and requires application of ontology matching methods, the mapping of field values to semantic entities described in \ref{sec:values2entities}.
    743 
     773Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
     774This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}.
     775
     776In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
     777another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}.
    744778
    745779%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     780
    746781
    747782
  • SMC4LRT/chapters/Infrastructure.tex

    r3671 r3776  
    9999
    100100As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
    101 Once a profile is finished, the Component Registry provides automatically the corresponding XML schema in the \code{cmd} target namespace \code{http://www.clarin.eu/cmd}, that can be used as base for creating and validating metadata records.
     101Once a profile is created, the Component Registry provides automatically the corresponding XML schema that can be used as base for creating and validating metadata records in the \code{cmd} namespace \code{http://www.clarin.eu/cmd}.
    102102
    103103\subsubsection*{Ontological Relations -- Relation Registry}
     
    110110
    111111There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
    112 This implementation stores the individual relations as RDF triples
    113 
    114 \begin{example3}
    115 subjectDatcat & relationPredicate & objectDatcat
    116 \end{example3}
    117 
    118 allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
     112This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
     113
     114\begin{definition}{The relation triples as stored by the Relation Registry}
     115\textless \ subjectDatcat \ relationPredicate \  objectDatcat \textgreater
     116\end{definition}
    119117
    120118\subsection{Further parts of the infrastructure}
     
    142140
    143141
    144 \subsection{CMDI - Exploitation side}
     142\subsection{CMDI exploitation side}
    145143\label{cmdi_exploitation}
    146144Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
     
    285283\lstset{language=XML}
    286284\begin{lstlisting}
    287         <dcif:conceptualDomain type="constrained">
    288                 <dcif:dataType>string</dcif:dataType>
    289                 <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
    290                 <dcif:rule>[a-z]{3}</dcif:rule>
    291         </dcif:conceptualDomain>
     285  <dcif:conceptualDomain type="constrained">
     286    <dcif:dataType>string</dcif:dataType>
     287    <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
     288    <dcif:rule>[a-z]{3}</dcif:rule>
     289  </dcif:conceptualDomain>
    292290\end{lstlisting}
    293291
     
    295293
    296294\begin{lstlisting}
    297         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
     295  <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
    298296\end{lstlisting}
    299297
     
    319317     <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
    320318      <dcif:rule>
    321          <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
     319         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639"
     320                                     type="closed"/>
    322321      </dcif:rule>
    323322  </dcif:conceptualDomain>
     
    359358%%%%%%%%%%%%%%%%%
    360359\section{Other aspects of the infrastructure}
    361 While this work concentrates solely on the metadata, it needs to be recognized, that it is only aspect of the infrastructure and its actual purpose the availability of resources. Metadata is a necessary first step to announce and describe the resources. However it is of little value, if the resources themselves are not accessible. Consequently, another pillar of the CLARIN infrastructure are the centres\furl{http://www.clarin.eu/node/3812}:
     360While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
     361
     362\subsubsection{CLARIN Centres}
     363One view on the CLARIN infrastructure is that of a network of centres\furl{http://www.clarin.eu/node/3812}:
    362364
    363365\begin{quotation}
     
    368370CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
    369371
    370 One core service of such centres are the content repositories, systems meant for long-term preservation and publication of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.
    371 
     372One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.
     373
     374\begin{comment}
    372375In the following a few further well established repositories are mentioned.
    373376
     
    379382\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}}
    380383\end{description}
    381 
     384\end{comment}
    382385
    383386\begin{figure*}
     
    389392\end{figure*}
    390393
    391 Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs}\cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via the aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50. The maintenance of SRU/CQL has been
    392 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
    393 
     394\subsubsection{Federated Content Search}
     395
     396Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
     397
     398Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}.
    394399
    395400\section{Summary}
  • SMC4LRT/chapters/Introduction.tex

    r3665 r3776  
    109109
    110110\section{Structure of the work}
    111 The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work.
    112 
    113 In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
     111The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
    114112
    115113The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module and a proposal how to model the data in RDF respectively.
     
    118116The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
    119117
     118The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).
     119
     120
    120121\section{Keywords}
    121122
  • SMC4LRT/chapters/Literature.tex

    r3681 r3776  
    1313In recent years, multiple large-scale initiatives have set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular.
    1414
    15 \xne{EAGLES/ISLE Meta Data Initiative} (IMDI) \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability.   
     15\xne{EAGLES/ISLE Meta Data Initiative} (IMDI)\furl{http://www.mpi.nl/imdi/} \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability.   
    1616
    1717\xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey.
    1818
    19 \xne{CLARIN} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
     19\xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
    2020are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}.
    2121Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI.
     
    3535\label{lit:digi-lib}
    3636
    37 In a broader view we should also regard the activities in the world of libraries.
    38 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, they certainly have a long tradition, wealth of experience and stable solutions.
    39 
    40 Mainly driven by national libraries still bigger aggregations of the bibliographic data are being set up.
    41  The biggest one being the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012})
    42 powered by OCLC, a cooperative of over 72.000 libraries worldwide.
    43 
    44 In Europe, more recent initiatives have pursuit similar goals:
     37In a broader view we should also regard the activities in the domain of libraries and information sciences (LIS).
     38Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.
     39%, starting collaborative efforts in mid 70s
     40
     41Driven mainly by national libraries still bigger aggregations of the bibliographic data are being set up.
     42 The biggest one is the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) powered by OCLC, a cooperative of over 72.000 libraries worldwide.
     43
     44In Europe, multiple recent initiatives have pursuit similar goals of pooling together the immense wealth of information sheltered in the many libraries:
    4545\xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
    4646
    47 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana).
    48 
    49 A large number of projects contribute(d) to Europeana. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
    50 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.
    51 
    52 The related catalogs and formats are described in the section \ref{sec:other-md-catalogs}
     47\xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections. (In fact, The European Library is the \emph{library aggregator} for Europeana.)
     48
     49A large number of projects contribute(d) to \xne{Europeana}. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, one of them being the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
     50Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) another initiative in the realm of \xne{Europeana} has been started, a Best Practice Network, coordinated by The European Library, designed to ``establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research''.
     51
     52The related catalogs and formats are described in the section \ref{sec:lib-formats}.
    5353
    5454
    5555\section{Existing crosswalks (services)}
    5656
    57 Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems and also in libraries,  e.g. \emph{MARC to Dublin Core Crosswalk}\furl{http://loc.gov/marc/marc2dc.html}
    58 
    59 \cite{Day2002crosswalks} lists a number of mappings between metadata formats.
    60 
    61 Mostly Dublin Core and MARC family of formats
    62 
    63 http://www.loc.gov/marc/dccross.html
    64 
    65 
    66 static
    67 metadata crosswalk repository
    68 
    69 
    70 OCLC launched \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
    71 in particular \xne{Crosswalk Web Service}\furl{http://www.oclc.org/developer/services/metadata-crosswalk-service}
    72 http://www.oclc.org/research/activities/xwalk.html
     57Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems as well as in the LIS domain. \cite{Day2002crosswalks} lists a number of mappings between metadata formats, mostly betweeen Dublin Core  and MARC families of formats.\footnote{\url{http://loc.gov/marc/marc2dc.html}, \url{http://www.loc.gov/marc/dccross.html}}
     58
     59However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort, that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
    7360
    7461\begin{quotation}
     
    7663\end{quotation}
    7764
    78 the Crosswalk Web Service is now a production system that has been incorporated into the following OCLC products and services.
    79 
    80 However the demo service is not available\furl{http://errol.oclc.org/schemaTrans.oclc.org.search}
    81 
    82 
    83 
    84 Offered formats?
    85 These however concentrate on the formats for the LIS community available and are ??
    86 
    87 For this service, a metadata format is defined as a triple of:
    88 
    89     Standard—the metadata standard of the record (e.g. MARC, DC, MODS, etc ...)
    90     Structure—the structure of how the metadata is expressed in the record (e.g. XML, RDF, ISO 2709, etc ...)
    91     Encoding—the character encoding of the metadata (e.g. MARC8, UTF-8, Windows 1251, etc ...)
    92 
    93 
    94 Offered interface!?
    95 he Crosswalk Web Service has 4 methods:
    96 
    97     translate(...) - This method translates the records. See the documentation for more information.
    98     getSupportedSourceRecordFormats() - This method returns a list of formats that are supported as input formats.
    99     getSupportedTargetRecordFormats() - This method returns a list of formats that the input formats can be translated to.
    100     getSupportedJavaEncodings() - Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports.
    101 
     65Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration as for the specification of the planned service.
     66
     67\begin{comment}
     68The Crosswalk Web Service has 4 methods:
     69\begin{description}
     70\item[translate()]  This method translates the records.
     71\item[getSupportedSourceRecordFormats()]  This method returns a list of formats that are supported as input formats.
     72\item[getSupportedTargetRecordFormats()] This method returns a list of formats that the input formats can be translated to.
     73\item[getSupportedJavaEncodings()] Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports.
     74\end{description}
     75\end{comment}
    10276
    10377
     
    154128This elegant abstraction introduced with the \var{similarity} function provides a general model that can accomodate a broad range of comparison relationships and corresponding similarity measures. And here, again, we encounter a broad range of possible approaches.
    155129
    156 \cite{ehrig2004qom} lists a number of basic features and corresponsing similarity measures:
    157 Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed.
    158 For concepts, next to the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again either as string similarity, but also broaded by employing external taxonomies and other semantic resources like WordNet - \emph{extensional} methods), equal (shared) class instances, shared superclasses, subclasses, properties.
    159 
    160 Element-level (terminological)  vs structure-level (structural)  \cite{Shvaiko2005_classification}
    161 
    162 based on background knowledge...
    163 
    164 subclass–superclass relationships, domains and ranges of properties, analysis of the graph structure of the ontology.
    165 
    166 For properties the degree of the super an subproperties equality, overlapping domain and/or range.
    167 Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy).
     130\cite{ehrig2004qom} lists a number of basic features and corresponding similarity measures, \cite{Shvaiko2005_classification} classifies the features into element-level (terminological), structure-level (structural)  and based on background knowledge (extensional):
     131Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed. For concepts, besides the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again, either as string similarity, but also by employing external taxonomies and other semantic resources like WordNet -- \emph{extensional} methods), equal (shared) class instances, subclass–superclass relationships, shared properties. For properties the degree of the super an subproperties equality, overlapping domain and/or range.
     132
     133Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy), or even to analyse and compare the overall graph structure of the ontology.
    168134
    169135\cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods.
     
    189155A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications:
    190156
    191 IF-Map \cite{kalfoglou2003if}, QOM \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, Similarity Flooding (SF) \cite{melnik}, S-Match \cite{Giunchiglia2007_semanticmatching}, the Prompt tools \cite{Noy2003_theprompt} integrating with Protégé or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
     157\xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{Protégé} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
    192158
    193159All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
     
    206172
    207173\subsubsection{Semantic Web - Technical solutions / Server applications}
    208 
    209 
    210 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently
    211 and idealiter expose them via a web interface to the users.
    212 
    213 Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as a number of commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
     174\label{semweb-tech}
     175
     176The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL\cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.
     177
     178Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
    214179
    215180A qualitative and quantitative study\cite{Haslhofer2011europeana}   in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion, that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
    216181
    217182\xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents.\cite{Erling2009Virtuoso, Haslhofer2011europeana}
    218 Virtuoso is used to host many important Linked Data sets (e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}).
     183Virtuoso is used to host many important Linked Data sets, e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.
    219184Virtuoso is offered both as commercial and open-source version license models exist.
    220185
    221186Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction.
    222187
    223 One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
    224 
     188One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. Another solution in a related, more specialized domain and already in productive use is \xne{rechercheisidore}\furl{http://rechercheisidore.fr} \cite{pouyllau2011isidore}, a french portal for digital humanities resources.
    225189
    226190\begin{comment}
     
    231195
    232196Haystack\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
     197
     198\todoin{check SARQ}\furl{http.//github.com/castagna/SARQ}
     199
    233200\end{comment}
    234201
    235202\subsubsection{Ontology Visualization}
    236203
    237 Landscape, Treemap, SOM
    238 
    239 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
     204The complex structured datasets like ontologies require dedicated means for their high-level exploration, like aggregations and interactive visualization techniques. A large variety of solutions has been implemented in the last two decades (cf. overview of the field in \cite{lanzenberger2010ontology}, also for citations of tools listed below). Given the inherent graph structure of the RDF data model, the obvious and most common approach is a tree- or graph-based visualization with concepts being represented as nodes and relations as edges. Numerous solutions are realized as plug-ins for the wide-spread open-source ontology editor \xne{Prot\'{e}g\'{e}} \cite{grosso1999protege} developed at Stanford University, like  \xne{OntoViz, Jambalaya, TouchGraph, OWLViz, OntoSphere, PromptViz} etc.
     205
     206There exists also a sizable number of stand-alone solutions (\xne{Ontorama, FOAFnaut, IsaViz, GKB-Editor} and more) though often bound to a specific dataset or data type (\xne{Wordnet, FOAF, Cyc}).
     207
     208There is also plenty of general graph visualization tools, that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.
     209
     210%Most recently a web-based version of this versatile tool has been released\furl{http://protegewiki.stanford.edu/wiki/WebProtege} that supports collaborative ontology development
     211
     212The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}  -- a tool which visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.
     213
     214\begin{figure*}
     215\begin{center}
     216\includegraphics[width=0.8\textwidth]{images/AlViz_screenshot.png}
     217\caption{Screenshot of AlViz -- tool for visual exploration of ontology alignment \cite{lanzenberger2006alviz}}
     218\label{fig:alviz}
     219\end{center}
     220\end{figure*}
    240221
    241222
     
    243224\section{Language and Ontologies}
    244225
    245 There are two different relation links betwee language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information.
     226There are two different relation links between language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information.
    246227
    247228\subsubsection{Linguistic ontologies}
     
    270251Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010.
    271252
    272 Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the terminology of the discipline linguistic is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
     253Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the discipline specific linguistic terminology is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
    273254
    274255\subsubsection{Lexicalised ontologies,``ontologized'' lexicons}
    275 
    276256
    277257The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies.
  • SMC4LRT/chapters/Results.tex

    r3681 r3776  
    3131\\
    3232
    33 \url{http://clarin.aac.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
     33\url{http://clarin.arz.oeaw.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
    3434
    3535
     
    4141This interface is available as part of the smc application:
    4242
    43 \url{http://clarin.aac.ac.at/smc/cx}
     43\url{http://clarin.arz.oeaw.ac.at/smc/cx}
    4444
    4545\subsection{SMC - as a module within Metadata Repository}
    4646The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
    4747
    48 \url{http://clarin.aac.ac.at/mdrepo/smc}
     48\url{http://clarin.arz.oeaw.ac.at/mdrepo/} (module not integrated yet )
    4949
    5050\subsection{SMC Browser -- advanced interactive user interface}
     
    5252SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under:
    5353
    54 \url{http://clarin.aac.ac.at/smc/browser}
     54\url{http://clarin.arz.oeaw.ac.at/smc-browser}
    5555
    5656\begin{figure*}
     
    287287\begin{figure*}[!ht]
    288288\begin{center}
    289 \includegraphics[width=1\textwidth]{images/just_profiles_6.png}
     289\includegraphics[width=1\textwidth]{images/just_profiles_9.png}
    290290\end{center}
    291291\caption{SMC cloud -- graph visualizing the semantic proximity of profiles}
  • SMC4LRT/chapters/acknowledgements.tex

    r2697 r3776  
    11\chapter*{Acknowledgements}
    22
    3 I would like to thank all the colleagues from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Daan Broeder, Menzo Windhouwer, Marc Kemps-Snijders, Hennie Brugman.
    4 
     3I would like to thank all the colleagues from my institute and from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\
     4And to all my dear one, for the extra portion of patience I demanded from them
     5\\
     6 \\
    57With love to em.
  • SMC4LRT/chapters/appendix.tex

    r3665 r3776  
    55
    66\chapter{Data model reference}
     7\label{ch:data-model-ref}
    78In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
    89
     
    3738
    3839\chapter{CMD -- sample data}
     40\label{ch:cmd-sample}
    3941
    4042\section{Definition of a CMD profile}
     43Following listing presents a sample CMD specification for the \concept{collection\#clarin.eu:cr1:p\_1345561703620} profile.
     44
     45\input{chapters/collection_spec.xml.tex}
    4146
    4247\section{CMD record}
     48Following listing represents a sample CMD record  - an instance of the \concept{collection} profile listed above.
     49
     50\input{chapters/collection_instance.xml.tex}
    4351
    4452
    45 \chapter{SMC Browser -- related material }
     53\chapter{SMC -- documentation}
     54\label{ch:smc-docs}
    4655
     56\begin{figure*}
     57\begin{center}
     58\includegraphics[height=1\textwidth, angle=90]{images/build_init.png}
     59\end{center}
     60\caption{A graphical representation of the dependencies and calls in the main \xne{ant} build file.}
     61\label{fig:smc-build_init}
     62\end{figure*}
    4763
    48 \begin{figure*}[!ht]
    49 \begin{center}
    50 \includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png}
    51 \end{center}
    52 \caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.}
    53 \label{fig:cmd-dep-dotgraph}
    54 \end{figure*}
     64\section{Documentation of smc-xsl}
     65\label{sec:smc-xsl-docs}
     66\todoin{generate and reference XSLT-documentation}
    5567
    5668\section{SMC Browser user documentation}
     
    6274\label{sec:smc-graphs}
    6375
     76\begin{figure*}[h]
     77\begin{center}
     78\includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png}
     79\end{center}
     80\caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.}
     81\label{fig:cmd-dep-dotgraph}
     82\end{figure*}
     83
     84
    6485\begin{comment}
    6586       
    6687\chapter{SMC Reports}
    67 \label{ch:smc-reports}
     88%\%label{ch:smc-reports}
    6889
    6990SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
  • SMC4LRT/chapters/userdocs_cleaned.tex

    r3666 r3776  
    9999The nodes are colour-coded by type:
    100100
    101 \includegraphics[height=100px]{C:/Users/m/3/clarin/_repo/SMC/docs/graph_legend.svg}
     101\includegraphics[height=100px]{images/graph_legend.png}
    102102
    103103\phantomsection\label{select-nodes}
  • SMC4LRT/images/Terms.xsd.tex

    r3640 r3776  
    22\begin{lstlisting}[label=lst:terms-schema, caption=Terms.xsd -- schema of the internal data model \ref{datamodel-terms}]
    33<?xml version="1.0" encoding="UTF-8"?>
    4 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" xmlns:ns2="http://www.w3.org/1999/xlink">
     4<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
     5elementFormDefault="qualified" xmlns:ns2="http://www.w3.org/1999/xlink">
    56  <xs:import namespace="http://www.w3.org/1999/xlink" schemaLocation="ns2.xsd"/>
    67  <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>
  • SMC4LRT/thesis.tex

    r3666 r3776  
    2828\thesisverfassung{Matej \v{D}ur\v{c}o} % Verfasser
    2929\thesisauthor{Matej \v{D}ur\v{c}o} % your name
    30 \thesisauthoraddress{Viktorgasse 8/6, 1040 Wien} % your address
     30\thesisauthoraddress{JosefstÀdterstrasse 70/32, 1080 Wien} % your address
    3131\thesismatrikelno{0005416} % your registration number
    3232
    33 \thesisbetreins{ao.Univ.-Prof.?? Dr. Andreas Rauber}
    34 \thesisbetrzwei{Univ.-Prof. Mag. Dr. Gerhard Budin}
     33\thesisbetreins{ao.Univ.-Prof. Dr. Andreas Rauber, Univ.-Prof. Mag. Dr. Gerhard Budin}
     34\thesisbetrzwei{}
    3535%\thesisbetrdrei{Dr. Vorname Familienname} % optional
    3636
     
    5858
    5959
    60 %\begin{comment}
     60\begin{comment}
     61\end{comment}
    6162\input{chapters/Introduction}
    6263
    6364\input{chapters/Literature}
    6465
    65 \input{chapters/Definitions}
    66 
    67 
    6866\input{chapters/Data}
    6967
    7068\input{chapters/Infrastructure}
     69
    7170\input{chapters/Design_SMCschema}
    72 
    73 
    7471
    7572\input{chapters/Design_SMCinstance}
    7673
    7774\input{chapters/Results}
    78 %\end{comment}
     75
    7976\input{chapters/Conclusion}
    8077
     
    9087%\bibliography{references}
    9188%\bibliographystyle{ieeetr}
    92 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own}
     89\bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own,../../2bib/diglib,../../2bib/it-misc,../../2bib/infovis}
    9390
    9491\appendix
     92
     93\input{chapters/Definitions}
    9594
    9695\input{chapters/appendix}
Note: See TracChangeset for help on using the changeset viewer.