Context Navigation

← Previous Change
Next Change →

Infrastructure.tex

Timestamp:

03/15/13 21:44:23 (11 years ago)

Author:

vronk

Message:

adding figures, big reorganization of the content structure
some new text on Controlled Vocabularies, Underlying Infrastructure.

File:

: 1 edited

SMC4LRT/chapters/Infrastructure.tex (modified) (8 diffs)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Infrastructure.tex

-                      r2696
+                      r2703
 \begin{itemize}
 \item Data Category Registry,
+\item Data Category Registry
 \item Relation Registry
+\item Schema Registry
 \item Component Registry
 \item Vocabulary Alignement Service (OpenSKOS)
 …
 !Cf. Erhard Hinrichs 2009
+\todoin{Describe SCHEMAcat}
 \noindent
 All these components are running services, that this work shall directly build upon.
 …
 \subsubsection{Vocabulary Service - CLAVAS}
 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closedâ concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
+As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
 …
 Following are those to be handled in short-term, in order of urgency/relevance/prirority:
 \begin{itemize}
 \item the list of language codes\todo{url: ISO-639}
+\item the list of language codes\todoin{url: ISO-639}
 \item country codes
 \item organization names for the domain of language resources
 …
 See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
 and \ref{dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
+and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
 \subsection{Interaction between DCR, VAS and client applications}
+In my view you do that in ISOcat by binding the constrained DC to the
+CLAVAS vocabulary, e.g., the constrained domain of /language ID/ (DC-2482)
+could look as follows:
+I think is no need to express the relationship between this constrained DC
+and the vocabulary in CLAVAS itself. Many DCs (or any other application
+using CLAVAS) can refer to the same CLAVAS vocabulary.
+See above for my reasoning. I don't think this information needs to be in
+CLAVAS.
+I do think that ISOcat, CLAVAS, RELcat, an actual language
+resource all provide a part of the semantic network.
+And if you can express these all in RDF, which we can for almost all of them (maybe
+except the actual language resource ... unless it has a schema adorned
+with ISOcat DC references ... < insert a SCHEMAcat plug ;-) >, but for
+metadata we have that in the CMDI profiles ...) you could load all the
+relevant parts in a triple store and do your SPARQL/reasoning on it. Well
+that's where I'm ultimately heading with all these registries related to
+semantic interoperability ... I hope ;-)
+Maybe I should add to this that I clearly see ISOcat as an user of CLAVAS,
+i.e., for constrained DCs.
+However, ISOcat as a provider of vocabularies
+is less clear to me. Many of the value domains are small and CLAVAS is
+overkill.
+\label{interaction-dcr-skos}
+DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
+simple, complex: closed, open, constrained, (container)?
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.7\textwidth]{images/dc_types}
+\end{center}
+\caption{Data Category types}
+\label{fig:dc_type}
+\end{figure*}
+\todocite{DC types - ISOcat introduction at CLARIN-NL Workshop}
+See \ref{fig:DCR_data_model} for full DCR data model.
+\subsubsection{Export DCR to SKOS}
+\todocite{Menzo2013-03-12 mail}
+The semantic proximity of a /data category/ to a /concept/ may mislead to
+a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept
+all of them belonging to the \xne{ISOcat-profile:ConceptScheme}.
+However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
+A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme.
+The rationale is, that if we see a vocabulary as a set of possible values for a
+field/element/attribute, complex DCs in ISOcat are the users of such
+vocabularies and simple DCs the DCR equivalence of values in such a
+vocabulary.\todocite{Menzo}
+Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs.
+Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
+So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
+That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
+Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
+i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using <dcr:datcat/> (and <dcterms:source/>).
+This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
+/representations/dcs2/clavas.xsl}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
+\end{center}
+\caption{The data flow and linking between schema, data categories and vocabularies}
+\label{fig:export_dcr2skos}
+\end{figure*}
+Open or constrained DCs are not exported as they don't provide anything to a vocabulary.
+There is no need to express the relationship between this constrained DC
+and the vocabulary in CLAVAS itself.
+Indeed it is not possible to express the conceptualDomain/range of a data category within SKOS.
+However, they can refer to a CLAVAS vocabulary. Indeed, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
+However it needs to be yet assessed how useful this approach is. In the metadata profile
+there are many closed DCs with small value domains. How useful are those
+in CLAVAS?
+Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data-model.
 Where the value domains are big (ISO 639-3) or can only be
 partially enumerated (organization names) ISOcat can't/shouldn't contain
 …
 providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
 stay in ISOcat. I think at some point we should create a smaller set of
+metadata DCs to be harvested by CLAVAS. Hennie and I discussed this also
+somewhere last year ... I'll be a the Meertens on Thursday, maybe we can
+talk it over once more.
+>>
+I guess the discussion is about two different things:
+- how to specify that the range of some metadata property consists of Concepts from a specific ConceptScheme
+-> this can not be done in SKOS, but external schema definitions could refer to the URI of some (CLAVAS/OpenSKOS) ConceptScheme
+- how to specifiy relations between Concepts that are in different ConceptSchemes
+-> this can be done in SKOS using skos: exactMatch, closeMatch, broaderMatch, narrowerMatch, relatedMatch. OpenSKOS supports adding and searching these properties already, and the OpenSKOS editor also already has support for it.
+> - define them in a new clavas namespace and add the properties as a specialization to OpenSKOS, you consider them part of the vocabulary definition then
+> --> is a bit against the OpenSKOS 'philosophy' that OpenSKOS is a platform for SKOS, by definition.
+> - add them to your metadata schema or profile, your consider them as constraints on vocabulary usage for a given metadata field
+> --> this would be my preference
+> - add them to a definition in ISOcat, and let your metadata schema refer to ISOcat instead of OpenSKOS. ISOcat extends the OpenSKOS definition then.
+> --> leads to mixing of ISOcat and OpenSKOS, in semantic and technical ways. Not my preference.
+In what I propose ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but you still have to be able to add new organization names). In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
+So although ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat. Hennie, I think that still meets your preference and prevents unwanted mixing.
+metadata DCs to be harvested by CLAVAS.
+Therefore a threshold seems sensible, where only value domains with more
+then 20, 50 or 100 values are exported.
+\subsubsection{Vocabulary linking and use}
+Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
+is by the means a XML Schema provides \todoin{check xml schema possibilities to restrict values}, like a regular expression. So for the data category \concept{languageID DC-2482}
+the rule looks like:
+\lstset{language=XML}
+\begin{lstlisting}
+        <dcif:conceptualDomain type="constrained">
+                <dcif:dataType>string</dcif:dataType>
+                <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
+                <dcif:rule>[a-z]{3}</dcif:rule>
+        </dcif:conceptualDomain>
+\end{lstlisting}
+A current proposal by Windhouwer\todocite{Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension:
+\begin{lstlisting}
+        <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
+\end{lstlisting}
+\code{@href} points to the vocabulary. Actually a PID should be used in the context
+of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
+\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
+valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
+This would yield a definition of the conceptualDomain for the data category as follows:
+\lstset{language=XML}
+\begin{lstlisting}
+  <dcif:conceptualDomain type="constrained">
+     <dcif:dataType>string</dcif:dataType>
+     <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
+     <dcif:rule>[a-z]{3}</dcif:rule>
+  </dcif:conceptualDomain>
+  <dcif:conceptualDomain type="constrained">
+     <dcif:dataType>string</dcif:dataType>
+     <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
+      <dcif:rule>
+         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
+      </dcif:rule>
+  </dcif:conceptualDomain>
+\end{lstlisting}
+I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
+\begin{note}
+Integrate:
+ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat.
+\end{note}
+Note though, that anything stated in the DC specification is not binding,
+but rather a generic hint or recommendation, \todoin{check: it is not ``normative''}.
+(Even if the DC is closed.) The authoritative/normative information is in the schema.
+A schema modeler, (concept)linking an element in the schema
+to a DC can decide to have another restriction for the values allowed
+in that element. The information from DCR serves as recommendation or default.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.7\textwidth]{images/concept_linking.png}
+\end{center}
+\caption{The data flow and linking between schema, data categories and vocabularies}
+\label{fig:concept_linking}
+\end{figure*}
+\paragraph {Modelling the vocabulary reference in the schema}
+It needs to be yet defined how the information about the vocabulary can be translated into a valid schema representation.
+One brute-force approach would be to explicitely enumerate all the values from the vocabulary. This is being currently done
+within the CMD-framework with the language-codes\todocite{cmd-component ISO-639}. However there is clearly a limit to this approach both in terms of size of the vocabulary (ISO-639 contains 7.679 items (language codes)  adding some 2MB to each schema referencing it) and its stability/change rate --- ISO-639 is a standard with a fixed list, however most other vocabularies are more volatile (think organization).
+Most of these vocabularies also cannot be seen as closed-constrained, i.e. the list that is provided, provides a recommended orthography variant for a given entity, still allowing other values for given field rather than resricting the values to only the items from the vocabulary (think organizations).
+So this has to be solved in ``soft'' way. Most schema languages allow to annotate the schema.
+This is already used with DCR, adding the \code{@dcr:datcat} into schema elements.
+Also CMDI (ComponentRegistry when generating schemas) puts information in <xs:appinfo/>.
+Tools like Arbil can get access to these annotations, e.g., a reference to a CLAVAS vocabulary, and act upon
+it, i.e., use OpenSKOSs autocomplete API.
+Normal XSD validation then wouldn't validate if a value actually is part of the vocabulary. This
+isn't a problem if the vocabulary is open, e.g., organisation names, but
+it is when the value domain is closed, e.g., ISO 639-3. In the latter case
+the XSD generation might have two modes: a lax (smaller) version which
+doesn't contain the closed vocabulary as an enumeration and leaves it to
+the tool, and a strict version which does contain the vocabulary as an
+enumeration. Probably the latter should stay the default, but Arbil could
+request the lax version leading to smaller and quicker XSD validation
+inside the tool.
+With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
+ In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
+\begin{note}
+\noindent
+something similar for the link to an EBNF grammar in SCHEMAcat:
+%\begin{lstlisting}
+\begin{verbatim}
+      <scr:valueSchema
+               xmlns:scr="http://www.isocat.org/ns/scr"
+               pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
+               type="ISO 14977:1996 EBNF"/>
+\end{verbatim}
+%\end{lstlisting}
+\end{note}
+Finally, the client application (e.g. a metadata editor) is configured/guided by the schema.
+It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
+\todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?}
 \subsection{CMDI - Exploitation side}
 Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todo{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
+Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
 \begin{figure*}[!ht]
 …
 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todo { describe indexing and search}
 \todo { add citation}
+More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
+\todocite {MI Search Engine}
 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers,
 …
 The requirements for these repositories: PIDs, CMD, OAI-PMH
 \todo{cite: center-B paper}
+\todocite{center-B paper}
 \section{Distrbuted system - federated search}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 2703 for SMC4LRT/chapters/Infrastructure.tex

Legend:

SMC4LRT/chapters/Infrastructure.tex

Download in other formats: