Ignore:
Timestamp:
03/15/13 21:44:23 (11 years ago)
Author:
vronk
Message:

adding figures, big reorganization of the content structure
some new text on Controlled Vocabularies, Underlying Infrastructure.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Infrastructure.tex

    r2696 r2703  
    1818
    1919\begin{itemize}
    20 \item Data Category Registry,
     20\item Data Category Registry
    2121\item Relation Registry
     22\item Schema Registry
    2223\item Component Registry
    2324\item Vocabulary Alignement Service (OpenSKOS)
     
    5960!Cf. Erhard Hinrichs 2009
    6061
     62\todoin{Describe SCHEMAcat}
     63
    6164\noindent
    6265All these components are running services, that this work shall directly build upon.
     
    8992
    9093\subsubsection{Vocabulary Service - CLAVAS}
    91 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed” concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
     94As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
    9295
    9396This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
     
    104107Following are those to be handled in short-term, in order of urgency/relevance/prirority:
    105108\begin{itemize}
    106 \item the list of language codes\todo{url: ISO-639}
     109\item the list of language codes\todoin{url: ISO-639}
    107110\item country codes
    108111\item organization names for the domain of language resources
     
    110113
    111114See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
    112 and \ref{dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
     115and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
    113116
    114117\subsection{Interaction between DCR, VAS and client applications}
    115 
    116 
    117 In my view you do that in ISOcat by binding the constrained DC to the
    118 CLAVAS vocabulary, e.g., the constrained domain of /language ID/ (DC-2482)
    119 could look as follows:
    120 
    121 I think is no need to express the relationship between this constrained DC
    122 and the vocabulary in CLAVAS itself. Many DCs (or any other application
    123 using CLAVAS) can refer to the same CLAVAS vocabulary.
    124 
    125 
    126 See above for my reasoning. I don't think this information needs to be in
    127 CLAVAS.
    128 I do think that ISOcat, CLAVAS, RELcat, an actual language
    129 resource all provide a part of the semantic network.
    130 
    131 And if you can express these all in RDF, which we can for almost all of them (maybe
    132 except the actual language resource ... unless it has a schema adorned
    133 with ISOcat DC references ... < insert a SCHEMAcat plug ;-) >, but for
    134 metadata we have that in the CMDI profiles ...) you could load all the
    135 relevant parts in a triple store and do your SPARQL/reasoning on it. Well
    136 that's where I'm ultimately heading with all these registries related to
    137 semantic interoperability ... I hope ;-)
    138 
    139 
    140 Maybe I should add to this that I clearly see ISOcat as an user of CLAVAS,
    141 i.e., for constrained DCs.
    142 
    143 However, ISOcat as a provider of vocabularies
    144 is less clear to me. Many of the value domains are small and CLAVAS is
    145 overkill.
    146 
     118\label{interaction-dcr-skos}
     119
     120
     121DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
     122simple, complex: closed, open, constrained, (container)?
     123
     124\begin{figure*}[!ht]
     125\begin{center}
     126\includegraphics[width=0.7\textwidth]{images/dc_types}
     127\end{center}
     128\caption{Data Category types}
     129\label{fig:dc_type}
     130\end{figure*}
     131\todocite{DC types - ISOcat introduction at CLARIN-NL Workshop}
     132
     133See \ref{fig:DCR_data_model} for full DCR data model.
     134
     135\subsubsection{Export DCR to SKOS}
     136\todocite{Menzo2013-03-12 mail}
     137
     138
     139The semantic proximity of a /data category/ to a /concept/ may mislead to
     140a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept
     141all of them belonging to the \xne{ISOcat-profile:ConceptScheme}.
     142However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
     143
     144A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme.
     145The rationale is, that if we see a vocabulary as a set of possible values for a
     146field/element/attribute, complex DCs in ISOcat are the users of such
     147vocabularies and simple DCs the DCR equivalence of values in such a
     148vocabulary.\todocite{Menzo}
     149
     150Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs.
     151Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
     152So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
     153That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
     154
     155Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
     156i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using <dcr:datcat/> (and <dcterms:source/>).
     157This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
     158/representations/dcs2/clavas.xsl}
     159
     160
     161
     162\begin{figure*}[!ht]
     163\begin{center}
     164\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
     165\end{center}
     166\caption{The data flow and linking between schema, data categories and vocabularies}
     167\label{fig:export_dcr2skos}
     168\end{figure*}
     169 
     170Open or constrained DCs are not exported as they don't provide anything to a vocabulary.
     171There is no need to express the relationship between this constrained DC
     172and the vocabulary in CLAVAS itself.
     173Indeed it is not possible to express the conceptualDomain/range of a data category within SKOS.
     174
     175However, they can refer to a CLAVAS vocabulary. Indeed, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
     176
     177However it needs to be yet assessed how useful this approach is. In the metadata profile
     178there are many closed DCs with small value domains. How useful are those
     179in CLAVAS?
     180
     181Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data-model.
    147182Where the value domains are big (ISO 639-3) or can only be
    148183partially enumerated (organization names) ISOcat can't/shouldn't contain
     
    152187providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
    153188stay in ISOcat. I think at some point we should create a smaller set of
    154 metadata DCs to be harvested by CLAVAS. Hennie and I discussed this also
    155 somewhere last year ... I'll be a the Meertens on Thursday, maybe we can
    156 talk it over once more.
    157 
    158 
    159 >>
    160 
    161 I guess the discussion is about two different things:
    162 - how to specify that the range of some metadata property consists of Concepts from a specific ConceptScheme
    163 -> this can not be done in SKOS, but external schema definitions could refer to the URI of some (CLAVAS/OpenSKOS) ConceptScheme
    164 - how to specifiy relations between Concepts that are in different ConceptSchemes
    165 -> this can be done in SKOS using skos: exactMatch, closeMatch, broaderMatch, narrowerMatch, relatedMatch. OpenSKOS supports adding and searching these properties already, and the OpenSKOS editor also already has support for it.
    166 
    167 > - define them in a new clavas namespace and add the properties as a specialization to OpenSKOS, you consider them part of the vocabulary definition then
    168 > --> is a bit against the OpenSKOS 'philosophy' that OpenSKOS is a platform for SKOS, by definition.
    169 > - add them to your metadata schema or profile, your consider them as constraints on vocabulary usage for a given metadata field
    170 > --> this would be my preference
    171 > - add them to a definition in ISOcat, and let your metadata schema refer to ISOcat instead of OpenSKOS. ISOcat extends the OpenSKOS definition then.
    172 > --> leads to mixing of ISOcat and OpenSKOS, in semantic and technical ways. Not my preference.
    173 
    174 In what I propose ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but you still have to be able to add new organization names). In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
    175 
    176 So although ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat. Hennie, I think that still meets your preference and prevents unwanted mixing.
    177 
    178 
    179 
     189metadata DCs to be harvested by CLAVAS.
     190Therefore a threshold seems sensible, where only value domains with more
     191then 20, 50 or 100 values are exported.
     192
     193
     194\subsubsection{Vocabulary linking and use}
     195Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
     196is by the means a XML Schema provides \todoin{check xml schema possibilities to restrict values}, like a regular expression. So for the data category \concept{languageID DC-2482}
     197the rule looks like:
     198\lstset{language=XML}
     199\begin{lstlisting}
     200        <dcif:conceptualDomain type="constrained">
     201                <dcif:dataType>string</dcif:dataType>
     202                <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
     203                <dcif:rule>[a-z]{3}</dcif:rule>
     204        </dcif:conceptualDomain>
     205\end{lstlisting}
     206
     207A current proposal by Windhouwer\todocite{Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension:
     208
     209\begin{lstlisting}
     210        <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
     211\end{lstlisting}
     212
     213\code{@href} points to the vocabulary. Actually a PID should be used in the context
     214of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
     215
     216\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
     217valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
     218
     219This would yield a definition of the conceptualDomain for the data category as follows:
     220 
     221\lstset{language=XML}
     222\begin{lstlisting}
     223  <dcif:conceptualDomain type="constrained">
     224     <dcif:dataType>string</dcif:dataType>
     225     <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
     226     <dcif:rule>[a-z]{3}</dcif:rule>
     227  </dcif:conceptualDomain>
     228  <dcif:conceptualDomain type="constrained">
     229     <dcif:dataType>string</dcif:dataType>
     230     <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
     231      <dcif:rule>
     232         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
     233      </dcif:rule>
     234  </dcif:conceptualDomain>
     235\end{lstlisting}
     236
     237I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
     238
     239\begin{note}
     240Integrate:
     241
     242ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat.
     243\end{note}
     244
     245Note though, that anything stated in the DC specification is not binding,
     246but rather a generic hint or recommendation, \todoin{check: it is not ``normative''}.
     247(Even if the DC is closed.) The authoritative/normative information is in the schema.
     248A schema modeler, (concept)linking an element in the schema
     249to a DC can decide to have another restriction for the values allowed
     250in that element. The information from DCR serves as recommendation or default.
     251
     252
     253\begin{figure*}[!ht]
     254\begin{center}
     255\includegraphics[width=0.7\textwidth]{images/concept_linking.png}
     256\end{center}
     257\caption{The data flow and linking between schema, data categories and vocabularies}
     258\label{fig:concept_linking}
     259\end{figure*}
     260 
     261
     262\paragraph {Modelling the vocabulary reference in the schema}
     263It needs to be yet defined how the information about the vocabulary can be translated into a valid schema representation.
     264One brute-force approach would be to explicitely enumerate all the values from the vocabulary. This is being currently done
     265within the CMD-framework with the language-codes\todocite{cmd-component ISO-639}. However there is clearly a limit to this approach both in terms of size of the vocabulary (ISO-639 contains 7.679 items (language codes)  adding some 2MB to each schema referencing it) and its stability/change rate --- ISO-639 is a standard with a fixed list, however most other vocabularies are more volatile (think organization).
     266
     267Most of these vocabularies also cannot be seen as closed-constrained, i.e. the list that is provided, provides a recommended orthography variant for a given entity, still allowing other values for given field rather than resricting the values to only the items from the vocabulary (think organizations).
     268
     269So this has to be solved in ``soft'' way. Most schema languages allow to annotate the schema.
     270This is already used with DCR, adding the \code{@dcr:datcat} into schema elements.
     271Also CMDI (ComponentRegistry when generating schemas) puts information in <xs:appinfo/>.
     272
     273Tools like Arbil can get access to these annotations, e.g., a reference to a CLAVAS vocabulary, and act upon
     274it, i.e., use OpenSKOSs autocomplete API.
     275Normal XSD validation then wouldn't validate if a value actually is part of the vocabulary. This
     276isn't a problem if the vocabulary is open, e.g., organisation names, but
     277it is when the value domain is closed, e.g., ISO 639-3. In the latter case
     278the XSD generation might have two modes: a lax (smaller) version which
     279doesn't contain the closed vocabulary as an enumeration and leaves it to
     280the tool, and a strict version which does contain the vocabulary as an
     281enumeration. Probably the latter should stay the default, but Arbil could
     282request the lax version leading to smaller and quicker XSD validation
     283inside the tool.
     284
     285With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
     286
     287 In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
     288
     289\begin{note}
     290\noindent
     291something similar for the link to an EBNF grammar in SCHEMAcat:
     292
     293%\begin{lstlisting}
     294\begin{verbatim}
     295      <scr:valueSchema
     296               xmlns:scr="http://www.isocat.org/ns/scr"
     297               pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
     298               type="ISO 14977:1996 EBNF"/>
     299\end{verbatim}
     300%\end{lstlisting}
     301\end{note}
     302
     303
     304Finally, the client application (e.g. a metadata editor) is configured/guided by the schema.
     305It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
     306
     307\todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?}
     308
     309
     310
     311       
    180312\subsection{CMDI - Exploitation side}
    181 Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todo{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
     313Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
    182314
    183315\begin{figure*}[!ht]
     
    189321The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
    190322
    191 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todo { describe indexing and search}
    192 \todo { add citation}
     323More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
     324\todocite {MI Search Engine}
    193325
    194326And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers,
     
    200332
    201333The requirements for these repositories: PIDs, CMD, OAI-PMH
    202 \todo{cite: center-B paper}
     334\todocite{center-B paper}
    203335
    204336\section{Distrbuted system - federated search}
Note: See TracChangeset for help on using the changeset viewer.