Changeset 2703


Ignore:
Timestamp:
03/15/13 21:44:23 (11 years ago)
Author:
vronk
Message:

adding figures, big reorganization of the content structure
some new text on Controlled Vocabularies, Underlying Infrastructure.

Location:
SMC4LRT
Files:
13 added
10 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Outline.tex

    r2695 r2703  
    2222%\svnid{$Id$}
    2323
     24\usepackage{titlesec}
     25\titlespacing*{\chapter}{0pt}{0.5in}{0.5in}
    2426
    2527%%% Examples of Article customizations
     
    3133\geometry{a4paper} % or letterpaper (US) or a5paper or....
    3234%\geometry{margin=1cm} % for example, change the margins to 2 inches all round
    33 \topmargin=-0.5in
     35\topmargin=-0.6in
    3436\textheight=700pt
    3537% \geometry{landscape} % set up the page for landscape
     
    5052%%% HEADERS & FOOTERS
    5153\usepackage{fancyhdr} % This should be set AFTER setting up the page geometry
    52 \pagestyle{plain} % options: empty , plain , fancy
     54\pagestyle{empty} % options: empty , plain , fancy
    5355\renewcommand{\headrulewidth}{0pt} % customise the layout...
    5456\lhead{}\chead{}\rhead{}
     
    6971
    7072
    71 \input{utils.tex}
     73\input{utils}
    7274
    7375%%% END Article customizations
     
    8284\begin{document}
    8385\maketitle
    84 
    85 \tableofcontents*
     86\newgeometry{top=0.8in,bottom=1in}
     87%\addtocontents{toc}{\protect\enlargethispage{35mm}}
     88\tableofcontents
     89\restoregeometry
    8690
    8791\listoffigures
     
    112116
    113117
    114 \section{Questions, Remarks}
    115 
    116 \begin{itemize}
    117 \item How does this relate to federated search?
    118 \item ontologicky vs. semaziologicky (Semanticke priznaky: kategoriálne/archysémy, difernciacne, specifikacne)
    119 \end{itemize}
    120 
    121118
    122119\bibliographystyle{ieeetr}
    123120\bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb}
    124121
     122\appendix
     123
     124\input{chapters/appendix}
     125
    125126
    126127\end{document}
  • SMC4LRT/chapters/Data.tex

    r2697 r2703  
    66
    77\section{Metadata Formats}
     8
    89
    910\subsection{CMD-Framework}
     
    3031\end{center}
    3132
     33\todoin{Collect number about CMD-Framework (profiles, datcats) + historical development}
     34
     35\todoin{Collect numbers about CMD records (collections, used profiles, ...) in historical perspective}
     36
    3237
    3338\subsection{Dublin Core + OLAC}
    3439
    3540DC, OLAC
     41
     42DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
    3643
    3744\subsection{TEI / teiHeader}
     
    113120\section{LRT Metadata Catalogs/Collections}
    114121
    115 \todo{[DFKI/LT-World]  - collection or ontology}
     122\todoin{Overview of catalogs, name, since, #providers, #resources}
     123
     124\todoin{[DFKI/LT-World]  - collection or ontology}
    116125
    117126\subsection{CMDI}
  • SMC4LRT/chapters/Infrastructure.tex

    r2696 r2703  
    1818
    1919\begin{itemize}
    20 \item Data Category Registry,
     20\item Data Category Registry
    2121\item Relation Registry
     22\item Schema Registry
    2223\item Component Registry
    2324\item Vocabulary Alignement Service (OpenSKOS)
     
    5960!Cf. Erhard Hinrichs 2009
    6061
     62\todoin{Describe SCHEMAcat}
     63
    6164\noindent
    6265All these components are running services, that this work shall directly build upon.
     
    8992
    9093\subsubsection{Vocabulary Service - CLAVAS}
    91 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed” concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
     94As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
    9295
    9396This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
     
    104107Following are those to be handled in short-term, in order of urgency/relevance/prirority:
    105108\begin{itemize}
    106 \item the list of language codes\todo{url: ISO-639}
     109\item the list of language codes\todoin{url: ISO-639}
    107110\item country codes
    108111\item organization names for the domain of language resources
     
    110113
    111114See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
    112 and \ref{dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
     115and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
    113116
    114117\subsection{Interaction between DCR, VAS and client applications}
    115 
    116 
    117 In my view you do that in ISOcat by binding the constrained DC to the
    118 CLAVAS vocabulary, e.g., the constrained domain of /language ID/ (DC-2482)
    119 could look as follows:
    120 
    121 I think is no need to express the relationship between this constrained DC
    122 and the vocabulary in CLAVAS itself. Many DCs (or any other application
    123 using CLAVAS) can refer to the same CLAVAS vocabulary.
    124 
    125 
    126 See above for my reasoning. I don't think this information needs to be in
    127 CLAVAS.
    128 I do think that ISOcat, CLAVAS, RELcat, an actual language
    129 resource all provide a part of the semantic network.
    130 
    131 And if you can express these all in RDF, which we can for almost all of them (maybe
    132 except the actual language resource ... unless it has a schema adorned
    133 with ISOcat DC references ... < insert a SCHEMAcat plug ;-) >, but for
    134 metadata we have that in the CMDI profiles ...) you could load all the
    135 relevant parts in a triple store and do your SPARQL/reasoning on it. Well
    136 that's where I'm ultimately heading with all these registries related to
    137 semantic interoperability ... I hope ;-)
    138 
    139 
    140 Maybe I should add to this that I clearly see ISOcat as an user of CLAVAS,
    141 i.e., for constrained DCs.
    142 
    143 However, ISOcat as a provider of vocabularies
    144 is less clear to me. Many of the value domains are small and CLAVAS is
    145 overkill.
    146 
     118\label{interaction-dcr-skos}
     119
     120
     121DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
     122simple, complex: closed, open, constrained, (container)?
     123
     124\begin{figure*}[!ht]
     125\begin{center}
     126\includegraphics[width=0.7\textwidth]{images/dc_types}
     127\end{center}
     128\caption{Data Category types}
     129\label{fig:dc_type}
     130\end{figure*}
     131\todocite{DC types - ISOcat introduction at CLARIN-NL Workshop}
     132
     133See \ref{fig:DCR_data_model} for full DCR data model.
     134
     135\subsubsection{Export DCR to SKOS}
     136\todocite{Menzo2013-03-12 mail}
     137
     138
     139The semantic proximity of a /data category/ to a /concept/ may mislead to
     140a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept
     141all of them belonging to the \xne{ISOcat-profile:ConceptScheme}.
     142However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
     143
     144A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme.
     145The rationale is, that if we see a vocabulary as a set of possible values for a
     146field/element/attribute, complex DCs in ISOcat are the users of such
     147vocabularies and simple DCs the DCR equivalence of values in such a
     148vocabulary.\todocite{Menzo}
     149
     150Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs.
     151Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
     152So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
     153That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
     154
     155Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
     156i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using <dcr:datcat/> (and <dcterms:source/>).
     157This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
     158/representations/dcs2/clavas.xsl}
     159
     160
     161
     162\begin{figure*}[!ht]
     163\begin{center}
     164\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
     165\end{center}
     166\caption{The data flow and linking between schema, data categories and vocabularies}
     167\label{fig:export_dcr2skos}
     168\end{figure*}
     169 
     170Open or constrained DCs are not exported as they don't provide anything to a vocabulary.
     171There is no need to express the relationship between this constrained DC
     172and the vocabulary in CLAVAS itself.
     173Indeed it is not possible to express the conceptualDomain/range of a data category within SKOS.
     174
     175However, they can refer to a CLAVAS vocabulary. Indeed, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
     176
     177However it needs to be yet assessed how useful this approach is. In the metadata profile
     178there are many closed DCs with small value domains. How useful are those
     179in CLAVAS?
     180
     181Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data-model.
    147182Where the value domains are big (ISO 639-3) or can only be
    148183partially enumerated (organization names) ISOcat can't/shouldn't contain
     
    152187providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
    153188stay in ISOcat. I think at some point we should create a smaller set of
    154 metadata DCs to be harvested by CLAVAS. Hennie and I discussed this also
    155 somewhere last year ... I'll be a the Meertens on Thursday, maybe we can
    156 talk it over once more.
    157 
    158 
    159 >>
    160 
    161 I guess the discussion is about two different things:
    162 - how to specify that the range of some metadata property consists of Concepts from a specific ConceptScheme
    163 -> this can not be done in SKOS, but external schema definitions could refer to the URI of some (CLAVAS/OpenSKOS) ConceptScheme
    164 - how to specifiy relations between Concepts that are in different ConceptSchemes
    165 -> this can be done in SKOS using skos: exactMatch, closeMatch, broaderMatch, narrowerMatch, relatedMatch. OpenSKOS supports adding and searching these properties already, and the OpenSKOS editor also already has support for it.
    166 
    167 > - define them in a new clavas namespace and add the properties as a specialization to OpenSKOS, you consider them part of the vocabulary definition then
    168 > --> is a bit against the OpenSKOS 'philosophy' that OpenSKOS is a platform for SKOS, by definition.
    169 > - add them to your metadata schema or profile, your consider them as constraints on vocabulary usage for a given metadata field
    170 > --> this would be my preference
    171 > - add them to a definition in ISOcat, and let your metadata schema refer to ISOcat instead of OpenSKOS. ISOcat extends the OpenSKOS definition then.
    172 > --> leads to mixing of ISOcat and OpenSKOS, in semantic and technical ways. Not my preference.
    173 
    174 In what I propose ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but you still have to be able to add new organization names). In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
    175 
    176 So although ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat. Hennie, I think that still meets your preference and prevents unwanted mixing.
    177 
    178 
    179 
     189metadata DCs to be harvested by CLAVAS.
     190Therefore a threshold seems sensible, where only value domains with more
     191then 20, 50 or 100 values are exported.
     192
     193
     194\subsubsection{Vocabulary linking and use}
     195Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
     196is by the means a XML Schema provides \todoin{check xml schema possibilities to restrict values}, like a regular expression. So for the data category \concept{languageID DC-2482}
     197the rule looks like:
     198\lstset{language=XML}
     199\begin{lstlisting}
     200        <dcif:conceptualDomain type="constrained">
     201                <dcif:dataType>string</dcif:dataType>
     202                <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
     203                <dcif:rule>[a-z]{3}</dcif:rule>
     204        </dcif:conceptualDomain>
     205\end{lstlisting}
     206
     207A current proposal by Windhouwer\todocite{Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension:
     208
     209\begin{lstlisting}
     210        <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
     211\end{lstlisting}
     212
     213\code{@href} points to the vocabulary. Actually a PID should be used in the context
     214of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
     215
     216\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
     217valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
     218
     219This would yield a definition of the conceptualDomain for the data category as follows:
     220 
     221\lstset{language=XML}
     222\begin{lstlisting}
     223  <dcif:conceptualDomain type="constrained">
     224     <dcif:dataType>string</dcif:dataType>
     225     <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
     226     <dcif:rule>[a-z]{3}</dcif:rule>
     227  </dcif:conceptualDomain>
     228  <dcif:conceptualDomain type="constrained">
     229     <dcif:dataType>string</dcif:dataType>
     230     <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
     231      <dcif:rule>
     232         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
     233      </dcif:rule>
     234  </dcif:conceptualDomain>
     235\end{lstlisting}
     236
     237I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
     238
     239\begin{note}
     240Integrate:
     241
     242ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat.
     243\end{note}
     244
     245Note though, that anything stated in the DC specification is not binding,
     246but rather a generic hint or recommendation, \todoin{check: it is not ``normative''}.
     247(Even if the DC is closed.) The authoritative/normative information is in the schema.
     248A schema modeler, (concept)linking an element in the schema
     249to a DC can decide to have another restriction for the values allowed
     250in that element. The information from DCR serves as recommendation or default.
     251
     252
     253\begin{figure*}[!ht]
     254\begin{center}
     255\includegraphics[width=0.7\textwidth]{images/concept_linking.png}
     256\end{center}
     257\caption{The data flow and linking between schema, data categories and vocabularies}
     258\label{fig:concept_linking}
     259\end{figure*}
     260 
     261
     262\paragraph {Modelling the vocabulary reference in the schema}
     263It needs to be yet defined how the information about the vocabulary can be translated into a valid schema representation.
     264One brute-force approach would be to explicitely enumerate all the values from the vocabulary. This is being currently done
     265within the CMD-framework with the language-codes\todocite{cmd-component ISO-639}. However there is clearly a limit to this approach both in terms of size of the vocabulary (ISO-639 contains 7.679 items (language codes)  adding some 2MB to each schema referencing it) and its stability/change rate --- ISO-639 is a standard with a fixed list, however most other vocabularies are more volatile (think organization).
     266
     267Most of these vocabularies also cannot be seen as closed-constrained, i.e. the list that is provided, provides a recommended orthography variant for a given entity, still allowing other values for given field rather than resricting the values to only the items from the vocabulary (think organizations).
     268
     269So this has to be solved in ``soft'' way. Most schema languages allow to annotate the schema.
     270This is already used with DCR, adding the \code{@dcr:datcat} into schema elements.
     271Also CMDI (ComponentRegistry when generating schemas) puts information in <xs:appinfo/>.
     272
     273Tools like Arbil can get access to these annotations, e.g., a reference to a CLAVAS vocabulary, and act upon
     274it, i.e., use OpenSKOSs autocomplete API.
     275Normal XSD validation then wouldn't validate if a value actually is part of the vocabulary. This
     276isn't a problem if the vocabulary is open, e.g., organisation names, but
     277it is when the value domain is closed, e.g., ISO 639-3. In the latter case
     278the XSD generation might have two modes: a lax (smaller) version which
     279doesn't contain the closed vocabulary as an enumeration and leaves it to
     280the tool, and a strict version which does contain the vocabulary as an
     281enumeration. Probably the latter should stay the default, but Arbil could
     282request the lax version leading to smaller and quicker XSD validation
     283inside the tool.
     284
     285With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
     286
     287 In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
     288
     289\begin{note}
     290\noindent
     291something similar for the link to an EBNF grammar in SCHEMAcat:
     292
     293%\begin{lstlisting}
     294\begin{verbatim}
     295      <scr:valueSchema
     296               xmlns:scr="http://www.isocat.org/ns/scr"
     297               pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
     298               type="ISO 14977:1996 EBNF"/>
     299\end{verbatim}
     300%\end{lstlisting}
     301\end{note}
     302
     303
     304Finally, the client application (e.g. a metadata editor) is configured/guided by the schema.
     305It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
     306
     307\todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?}
     308
     309
     310
     311       
    180312\subsection{CMDI - Exploitation side}
    181 Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todo{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
     313Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
    182314
    183315\begin{figure*}[!ht]
     
    189321The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
    190322
    191 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todo { describe indexing and search}
    192 \todo { add citation}
     323More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
     324\todocite {MI Search Engine}
    193325
    194326And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers,
     
    200332
    201333The requirements for these repositories: PIDs, CMD, OAI-PMH
    202 \todo{cite: center-B paper}
     334\todocite{center-B paper}
    203335
    204336\section{Distrbuted system - federated search}
  • SMC4LRT/chapters/Introduction.tex

    r2697 r2703  
    1414
    1515
     16\todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/}
     17
     18
    1619\subsection{Problem statement}
    1720
    1821While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
    1922
    20 \todo{Need some number about the disparity in the field, number of institutes, resources, formats.}
     23\todoin{Need some number about the disparity in the field, number of institutes, resources, formats.}
    2124
    2225This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
  • SMC4LRT/chapters/Literature.tex

    r2697 r2703  
    44%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    55
    6 This work is guided by \todo{two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology  and the method - Semantic Web technologies. This division is reflected in the following chapter:
     6This work is guided by \todoin{two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology  and the method - Semantic Web technologies. This division is reflected in the following chapter:
    77
    88\section{(Infrastructure for) Language Resources and Technology}
     
    2727\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \footnote{\url{https://phaidra.univie.ac.at/}}
    2828\item[eSciDoc]  provided by MPG + FIZ Karlsruhe \footnote{\url{https://www.escidoc.org/}}
     29\item[TextGrid] \todocode{install: TextGrid2 - check: TG-search}\furl{http:/textgrid.de}
    2930\item[DRIVER] pan-European infrastructure of Digital Repositories \footnote{\url{http://www.driver-repository.eu/}}
    3031\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}}
     
    4243
    4344\subsection{FederatedSearch}
    44 
     45\todoask{How to relate Federated Search to SMC? }
    4546
    4647
    4748\section{Semantic Web}
    4849
    49 \todo{cite TimBL}
     50\todoin{cite TimBL}
    5051
    5152\begin{description}
     
    6667
    6768One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
     69
     70\todoin{check if relevant: http://schema.org/}
    6871
    6972\subsection{Ontology Visualization}
  • SMC4LRT/chapters/SMC.tex

    r2696 r2703  
    33
    44
    5 \section{Data Model}
     5\section{Data Model?}
    66
    77Terms ?
     
    1010RDF
    1111
    12 
    1312\subsection{CMD namespace}
    1413Describe the CMD-format?
    15 
    16 
    17 \subsection{DCR in SKOS}
    18 \label{dcr-skos}
    19 Describe the mapping from DCR into SKOS
    20 
    21 DCR recognizes following types of data categories:
    22 simple, complex: closed, open, constrained, (container)?
    23 
    24 \begin{figure*}[!ht]
    25 \begin{center}
    26 \includegraphics[width=0.7\textwidth]{images/dc_types}
    27 \end{center}
    28 \caption{Data Category types}
    29 \end{figure*}
    30 \todo{cite:  ISOcat introduction at CLARIN-NL Workshop}
    31 
    32 The export to CLAVAS-SKOS only considers/regards closed and simple DCs from the metadata profile are exported.
    33 A closed DC maps to a concept scheme and a simple DC to a SKOS concept in such a concept scheme.
    34 However it needs to be yet assessed how useful this approach is. In the metadata profile
    35 there are many closed DCs with small value domains. How useful are those
    36 in CLAVAS?
    37 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains,
    38 that do not fit easily in the DCR data-model.
    39 Therefore a threshold seems sensible, where only value domains with more
    40 then 20, 50 or 100 values are exported.
    41 
    42 Open or constrained DCs are not exported as they don't provide anything to a vocabulary. \todo{cite: Menzo2013-03-12 mail}
    43 However, they can become users of a CLAVAS vocabulary. Actually, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
    44 
    45 Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
    46 is by the means a XML Schema provides, like a regular expression. So for the data category \concept{languageID DC-2482}
    47 the rule looks like:
    48 \lstset{language=XML}
    49 \begin{lstlisting}
    50         <dcif:conceptualDomain type="constrained">
    51                 <dcif:dataType>string</dcif:dataType>
    52                 <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
    53                 <dcif:rule>[a-z]{3}</dcif:rule>
    54         </dcif:conceptualDomain>
    55 \end{lstlisting}
    56 
    57 A current proposal by Windhouwer\todo{cite: Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension:
    58 
    59 \begin{lstlisting}
    60         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
    61 \end{lstlisting}
    62 
    63 \code{@href} points to the vocabulary. Actually a PID should be used in the context
    64 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
    65 
    66 \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
    67 valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
    68 
    69 This would yield a definition of the conceptualDomain for the data category as follows:
    70  
    71 \lstset{language=XML}
    72 \begin{lstlisting}
    73   <dcif:conceptualDomain type="constrained">
    74      <dcif:dataType>string</dcif:dataType>
    75      <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
    76      <dcif:rule>[a-z]{3}</dcif:rule>
    77   </dcif:conceptualDomain>
    78   <dcif:conceptualDomain type="constrained">
    79      <dcif:dataType>string</dcif:dataType>
    80      <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
    81       <dcif:rule>
    82          <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
    83       </dcif:rule>
    84   </dcif:conceptualDomain>
    85 \end{lstlisting}
    86 
    87 I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS
    88 lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
    89 
    90  
    91 \begin{note}
    92 
    93 \noindent
    94 something similar for the link to an EBNF grammar in SCHEMAcat:
    95 
    96 %\begin{lstlisting}
    97 \begin{verbatim}
    98       <scr:valueSchema
    99                xmlns:scr="http://www.isocat.org/ns/scr"
    100                pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
    101                type="ISO 14977:1996 EBNF"/>
    102 \end{verbatim}
    103 %\end{lstlisting}
    104 \end{note}
    10514
    10615
     
    234143\begin{enumerate}
    235144\item  express MDRecords in RDF
    236 \item  identify related ontologies/vocabularies (category -> vocabulary)
     145\item  identify related ontologies/vocabularies (category $\rightarrow$ vocabulary)
    237146\item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
    238147
    239148%\fbox{ function lookup: Category x String -> ConceptualDomain}
    240149\begin{eqnarray*}
    241 lookup(Category, Literal) -> ConceptualDomain??
     150lookup(Category, Literal) \rightarrow ConceptualDomain??
    242151\end{eqnarray*}
    243152
     
    249158\subsection{Linked Data - Express dataset in RDF}
    250159
     160
     161I do think that ISOcat, CLAVAS, RELcat, an actual language
     162resource all provide a part of the semantic network.
     163
     164And if you can express these all in RDF, which we can for almost all of them (maybe
     165except the actual language resource ... unless it has a schema adorned
     166with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for
     167metadata we have that in the CMDI profiles ...) you could load all the
     168relevant parts in a triple store and do your SPARQL/reasoning on it. Well
     169that's where I'm ultimately heading with all these registries related to
     170semantic interoperability ... I hope ;-)
     171\todocite{Menzo}
     172
     173
    251174Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
    252175So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
     
    254177
    255178Technical aspects (RDF-store?) / interface (ontology browser?)
     179
     180\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
    256181
    257182defining the Mapping:
    258183\begin{enumerate}
    259184\item convert to RDF
    260 translate: MDREcord -> [\#mdrecord \#property literal]
    261 \item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
     185translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal]
     186\item map: \#mdrecord \#property literal  $\rightarrow$ [\#mdrecord \#property \#entity]
    262187\end{enumerate}
    263188
     
    266191\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
    267192\caption{The process of transforming the CMD metadata records to and RDF representation}
     193\label{fig:smc_cmd2lod}
    268194\end{figure*}
    269195
  • SMC4LRT/chapters/System.tex

    r2697 r2703  
    5656
    5757
     58\section{SMC LOD}
    5859
    59 \section{User Interface}
     60\todoin{read: Europeana RDF Store Report}
     61
     62\todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
     63
     64\todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
     65
     66
     67\section{User Interface?}
    6068
    6169\subsection{Query Input}
  • SMC4LRT/thesis.tex

    r2695 r2703  
    4242% define custom macros for specific formats or names
    4343
    44 \newcommand{\todo}[1]{\textcolor{red}{#1}}
    45 \newcommand{\concept}[1]{\texttt{#1}}
    46 \newcommand{\furl}[1]{\footnote{\url{#1}}}
    47 \newcommand{\ftodo}[1]{\footnote{\todo{#1}}}
    48 \newcommand{\xne}[1]{\textsf{#1}}
    49 \newcommand{\cd}{\textsf{Class Diagram}}
    50 
     44\input{utils}
    5145
    5246\setcounter{tocdepth}{2}
     
    9589\appendix
    9690
     91\input{chapters/appendix}
     92
    9793\bibliographystyle{plain}
    9894%\bibliography{references}
  • SMC4LRT/utils.tex

    r2695 r2703  
    66\usetikzlibrary{arrows,automata}
    77
    8 \usepackage[textsize=footnotesize, textwidth=1in, colorinlistoftodos=1,
    9                 bordercolor=todoborder, linecolor=todoborder, backgroundcolor=todobg]
    10                 {todonotes}
     8 % disable
     9\usepackage[textsize=footnotesize, textwidth=1in, colorinlistoftodos=1,                 bordercolor=todoborder, linecolor=todoborder, backgroundcolor=todobg]{todonotes}
    1110
    1211\newcommand{\todoin}[1]{\todo[inline]{#1}}
     12\newcommand{\todocite}[1]{\todo[inline,backgroundcolor=cite]{#1}}
     13\newcommand{\todoask}[1]{\todo[inline,backgroundcolor=ask]{#1}}
     14\newcommand{\todocode}[1]{\todo[inline,backgroundcolor=code]{#1}} % anything that runs: installing, implementing, data transform
    1315\newcommand{\concept}[1]{\textsf{#1}}
    1416\newcommand{\code}[1]{\texttt{#1}}
    1517\newcommand{\xne}[1]{\textsf{#1}}
    1618\newcommand{\furl}[1]{\footnote{\url{#1}}}
    17 \newcommand{\ftodo}[1]{\footnote{\todo{#1}}}
     19\newcommand{\ftodo}[1]{\footnote{\todoin{#1}}}
    1820
    1921\newenvironment{note}
     
    3133
    3234\definecolor{todobg}{rgb}{0.8,0.8,1}
     35\definecolor{cite}{rgb}{0.8,1,0.8}
     36\definecolor{ask}{rgb}{1,1,0.8}
     37\definecolor{code}{rgb}{1,0.8,0.8}
    3338\definecolor{todoborder}{rgb}{0.8,0.4,0.4}
    3439
    3540 
    3641\lstset{
    37   basicstyle=\ttfamily,
     42  basicstyle=\ttfamily\footnotesize,
    3843  columns=fullflexible,
    3944  showstringspaces=false,
     
    4348\lstdefinelanguage{XML}
    4449{
    45   basicstyle=\ttfamily\color{darkblue}\bfseries,
     50  basicstyle=\ttfamily\color{darkblue}\bfseries\footnotesize,
    4651  morestring=[b]",
    4752  morestring=[s]{>}{<},
     
    5257  morekeywords={xmlns,version,type}% list your attributes here
    5358}
    54 
Note: See TracChangeset for help on using the changeset viewer.