- Timestamp:
- 03/15/13 21:44:23 (11 years ago)
- Location:
- SMC4LRT
- Files:
-
- 13 added
- 10 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/Outline.tex
r2695 r2703 22 22 %\svnid{$Id$} 23 23 24 \usepackage{titlesec} 25 \titlespacing*{\chapter}{0pt}{0.5in}{0.5in} 24 26 25 27 %%% Examples of Article customizations … … 31 33 \geometry{a4paper} % or letterpaper (US) or a5paper or.... 32 34 %\geometry{margin=1cm} % for example, change the margins to 2 inches all round 33 \topmargin=-0. 5in35 \topmargin=-0.6in 34 36 \textheight=700pt 35 37 % \geometry{landscape} % set up the page for landscape … … 50 52 %%% HEADERS & FOOTERS 51 53 \usepackage{fancyhdr} % This should be set AFTER setting up the page geometry 52 \pagestyle{ plain} % options: empty , plain , fancy54 \pagestyle{empty} % options: empty , plain , fancy 53 55 \renewcommand{\headrulewidth}{0pt} % customise the layout... 54 56 \lhead{}\chead{}\rhead{} … … 69 71 70 72 71 \input{utils .tex}73 \input{utils} 72 74 73 75 %%% END Article customizations … … 82 84 \begin{document} 83 85 \maketitle 84 85 \tableofcontents* 86 \newgeometry{top=0.8in,bottom=1in} 87 %\addtocontents{toc}{\protect\enlargethispage{35mm}} 88 \tableofcontents 89 \restoregeometry 86 90 87 91 \listoffigures … … 112 116 113 117 114 \section{Questions, Remarks}115 116 \begin{itemize}117 \item How does this relate to federated search?118 \item ontologicky vs. semaziologicky (Semanticke priznaky: kategoriálne/archysémy, difernciacne, specifikacne)119 \end{itemize}120 121 118 122 119 \bibliographystyle{ieeetr} 123 120 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb} 124 121 122 \appendix 123 124 \input{chapters/appendix} 125 125 126 126 127 \end{document} -
SMC4LRT/chapters/Data.tex
r2697 r2703 6 6 7 7 \section{Metadata Formats} 8 8 9 9 10 \subsection{CMD-Framework} … … 30 31 \end{center} 31 32 33 \todoin{Collect number about CMD-Framework (profiles, datcats) + historical development} 34 35 \todoin{Collect numbers about CMD records (collections, used profiles, ...) in historical perspective} 36 32 37 33 38 \subsection{Dublin Core + OLAC} 34 39 35 40 DC, OLAC 41 42 DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/} 36 43 37 44 \subsection{TEI / teiHeader} … … 113 120 \section{LRT Metadata Catalogs/Collections} 114 121 115 \todo{[DFKI/LT-World] - collection or ontology} 122 \todoin{Overview of catalogs, name, since, #providers, #resources} 123 124 \todoin{[DFKI/LT-World] - collection or ontology} 116 125 117 126 \subsection{CMDI} -
SMC4LRT/chapters/Infrastructure.tex
r2696 r2703 18 18 19 19 \begin{itemize} 20 \item Data Category Registry ,20 \item Data Category Registry 21 21 \item Relation Registry 22 \item Schema Registry 22 23 \item Component Registry 23 24 \item Vocabulary Alignement Service (OpenSKOS) … … 59 60 !Cf. Erhard Hinrichs 2009 60 61 62 \todoin{Describe SCHEMAcat} 63 61 64 \noindent 62 65 All these components are running services, that this work shall directly build upon. … … 89 92 90 93 \subsubsection{Vocabulary Service - CLAVAS} 91 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed âconcept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).94 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added). 92 95 93 96 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge. … … 104 107 Following are those to be handled in short-term, in order of urgency/relevance/prirority: 105 108 \begin{itemize} 106 \item the list of language codes\todo {url: ISO-639}109 \item the list of language codes\todoin{url: ISO-639} 107 110 \item country codes 108 111 \item organization names for the domain of language resources … … 110 113 111 114 See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies 112 and \ref{ dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.115 and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}. 113 116 114 117 \subsection{Interaction between DCR, VAS and client applications} 115 116 117 In my view you do that in ISOcat by binding the constrained DC to the 118 CLAVAS vocabulary, e.g., the constrained domain of /language ID/ (DC-2482) 119 could look as follows: 120 121 I think is no need to express the relationship between this constrained DC 122 and the vocabulary in CLAVAS itself. Many DCs (or any other application 123 using CLAVAS) can refer to the same CLAVAS vocabulary. 124 125 126 See above for my reasoning. I don't think this information needs to be in 127 CLAVAS. 128 I do think that ISOcat, CLAVAS, RELcat, an actual language 129 resource all provide a part of the semantic network. 130 131 And if you can express these all in RDF, which we can for almost all of them (maybe 132 except the actual language resource ... unless it has a schema adorned 133 with ISOcat DC references ... < insert a SCHEMAcat plug ;-) >, but for 134 metadata we have that in the CMDI profiles ...) you could load all the 135 relevant parts in a triple store and do your SPARQL/reasoning on it. Well 136 that's where I'm ultimately heading with all these registries related to 137 semantic interoperability ... I hope ;-) 138 139 140 Maybe I should add to this that I clearly see ISOcat as an user of CLAVAS, 141 i.e., for constrained DCs. 142 143 However, ISOcat as a provider of vocabularies 144 is less clear to me. Many of the value domains are small and CLAVAS is 145 overkill. 146 118 \label{interaction-dcr-skos} 119 120 121 DCR recognizes following types of data categories (Figure \ref{fig:dc_type}): 122 simple, complex: closed, open, constrained, (container)? 123 124 \begin{figure*}[!ht] 125 \begin{center} 126 \includegraphics[width=0.7\textwidth]{images/dc_types} 127 \end{center} 128 \caption{Data Category types} 129 \label{fig:dc_type} 130 \end{figure*} 131 \todocite{DC types - ISOcat introduction at CLARIN-NL Workshop} 132 133 See \ref{fig:DCR_data_model} for full DCR data model. 134 135 \subsubsection{Export DCR to SKOS} 136 \todocite{Menzo2013-03-12 mail} 137 138 139 The semantic proximity of a /data category/ to a /concept/ may mislead to 140 a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept 141 all of them belonging to the \xne{ISOcat-profile:ConceptScheme}. 142 However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary. 143 144 A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme. 145 The rationale is, that if we see a vocabulary as a set of possible values for a 146 field/element/attribute, complex DCs in ISOcat are the users of such 147 vocabularies and simple DCs the DCR equivalence of values in such a 148 vocabulary.\todocite{Menzo} 149 150 Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs. 151 Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}. 152 So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts]. 153 That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes. 154 155 Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created, 156 i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using <dcr:datcat/> (and <dcterms:source/>). 157 This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest 158 /representations/dcs2/clavas.xsl} 159 160 161 162 \begin{figure*}[!ht] 163 \begin{center} 164 \includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png} 165 \end{center} 166 \caption{The data flow and linking between schema, data categories and vocabularies} 167 \label{fig:export_dcr2skos} 168 \end{figure*} 169 170 Open or constrained DCs are not exported as they don't provide anything to a vocabulary. 171 There is no need to express the relationship between this constrained DC 172 and the vocabulary in CLAVAS itself. 173 Indeed it is not possible to express the conceptualDomain/range of a data category within SKOS. 174 175 However, they can refer to a CLAVAS vocabulary. Indeed, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository. 176 177 However it needs to be yet assessed how useful this approach is. In the metadata profile 178 there are many closed DCs with small value domains. How useful are those 179 in CLAVAS? 180 181 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data-model. 147 182 Where the value domains are big (ISO 639-3) or can only be 148 183 partially enumerated (organization names) ISOcat can't/shouldn't contain … … 152 187 providers, e.g., /linguistic subject/ (DC-2527/), and still also need to 153 188 stay in ISOcat. I think at some point we should create a smaller set of 154 metadata DCs to be harvested by CLAVAS. Hennie and I discussed this also 155 somewhere last year ... I'll be a the Meertens on Thursday, maybe we can 156 talk it over once more. 157 158 159 >> 160 161 I guess the discussion is about two different things: 162 - how to specify that the range of some metadata property consists of Concepts from a specific ConceptScheme 163 -> this can not be done in SKOS, but external schema definitions could refer to the URI of some (CLAVAS/OpenSKOS) ConceptScheme 164 - how to specifiy relations between Concepts that are in different ConceptSchemes 165 -> this can be done in SKOS using skos: exactMatch, closeMatch, broaderMatch, narrowerMatch, relatedMatch. OpenSKOS supports adding and searching these properties already, and the OpenSKOS editor also already has support for it. 166 167 > - define them in a new clavas namespace and add the properties as a specialization to OpenSKOS, you consider them part of the vocabulary definition then 168 > --> is a bit against the OpenSKOS 'philosophy' that OpenSKOS is a platform for SKOS, by definition. 169 > - add them to your metadata schema or profile, your consider them as constraints on vocabulary usage for a given metadata field 170 > --> this would be my preference 171 > - add them to a definition in ISOcat, and let your metadata schema refer to ISOcat instead of OpenSKOS. ISOcat extends the OpenSKOS definition then. 172 > --> leads to mixing of ISOcat and OpenSKOS, in semantic and technical ways. Not my preference. 173 174 In what I propose ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but you still have to be able to add new organization names). In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning). 175 176 So although ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat. Hennie, I think that still meets your preference and prevents unwanted mixing. 177 178 179 189 metadata DCs to be harvested by CLAVAS. 190 Therefore a threshold seems sensible, where only value domains with more 191 then 20, 50 or 100 values are exported. 192 193 194 \subsubsection{Vocabulary linking and use} 195 Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category 196 is by the means a XML Schema provides \todoin{check xml schema possibilities to restrict values}, like a regular expression. So for the data category \concept{languageID DC-2482} 197 the rule looks like: 198 \lstset{language=XML} 199 \begin{lstlisting} 200 <dcif:conceptualDomain type="constrained"> 201 <dcif:dataType>string</dcif:dataType> 202 <dcif:ruleType>XML Schema regular expression</dcif:ruleType> 203 <dcif:rule>[a-z]{3}</dcif:rule> 204 </dcif:conceptualDomain> 205 \end{lstlisting} 206 207 A current proposal by Windhouwer\todocite{Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension: 208 209 \begin{lstlisting} 210 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/> 211 \end{lstlisting} 212 213 \code{@href} points to the vocabulary. Actually a PID should be used in the context 214 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core. 215 216 \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are 217 valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open. 218 219 This would yield a definition of the conceptualDomain for the data category as follows: 220 221 \lstset{language=XML} 222 \begin{lstlisting} 223 <dcif:conceptualDomain type="constrained"> 224 <dcif:dataType>string</dcif:dataType> 225 <dcif:ruleType>XML Schema regular expression</dcif:ruleType> 226 <dcif:rule>[a-z]{3}</dcif:rule> 227 </dcif:conceptualDomain> 228 <dcif:conceptualDomain type="constrained"> 229 <dcif:dataType>string</dcif:dataType> 230 <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType> 231 <dcif:rule> 232 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/> 233 </dcif:rule> 234 </dcif:conceptualDomain> 235 \end{lstlisting} 236 237 I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS lookup but are capable of XSD/RNG validation, can still use the regular expression based definition. 238 239 \begin{note} 240 Integrate: 241 242 ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat. 243 \end{note} 244 245 Note though, that anything stated in the DC specification is not binding, 246 but rather a generic hint or recommendation, \todoin{check: it is not ``normative''}. 247 (Even if the DC is closed.) The authoritative/normative information is in the schema. 248 A schema modeler, (concept)linking an element in the schema 249 to a DC can decide to have another restriction for the values allowed 250 in that element. The information from DCR serves as recommendation or default. 251 252 253 \begin{figure*}[!ht] 254 \begin{center} 255 \includegraphics[width=0.7\textwidth]{images/concept_linking.png} 256 \end{center} 257 \caption{The data flow and linking between schema, data categories and vocabularies} 258 \label{fig:concept_linking} 259 \end{figure*} 260 261 262 \paragraph {Modelling the vocabulary reference in the schema} 263 It needs to be yet defined how the information about the vocabulary can be translated into a valid schema representation. 264 One brute-force approach would be to explicitely enumerate all the values from the vocabulary. This is being currently done 265 within the CMD-framework with the language-codes\todocite{cmd-component ISO-639}. However there is clearly a limit to this approach both in terms of size of the vocabulary (ISO-639 contains 7.679 items (language codes) adding some 2MB to each schema referencing it) and its stability/change rate --- ISO-639 is a standard with a fixed list, however most other vocabularies are more volatile (think organization). 266 267 Most of these vocabularies also cannot be seen as closed-constrained, i.e. the list that is provided, provides a recommended orthography variant for a given entity, still allowing other values for given field rather than resricting the values to only the items from the vocabulary (think organizations). 268 269 So this has to be solved in ``soft'' way. Most schema languages allow to annotate the schema. 270 This is already used with DCR, adding the \code{@dcr:datcat} into schema elements. 271 Also CMDI (ComponentRegistry when generating schemas) puts information in <xs:appinfo/>. 272 273 Tools like Arbil can get access to these annotations, e.g., a reference to a CLAVAS vocabulary, and act upon 274 it, i.e., use OpenSKOSs autocomplete API. 275 Normal XSD validation then wouldn't validate if a value actually is part of the vocabulary. This 276 isn't a problem if the vocabulary is open, e.g., organisation names, but 277 it is when the value domain is closed, e.g., ISO 639-3. In the latter case 278 the XSD generation might have two modes: a lax (smaller) version which 279 doesn't contain the closed vocabulary as an enumeration and leaves it to 280 the tool, and a strict version which does contain the vocabulary as an 281 enumeration. Probably the latter should stay the default, but Arbil could 282 request the lax version leading to smaller and quicker XSD validation 283 inside the tool. 284 285 With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary). 286 287 In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning). 288 289 \begin{note} 290 \noindent 291 something similar for the link to an EBNF grammar in SCHEMAcat: 292 293 %\begin{lstlisting} 294 \begin{verbatim} 295 <scr:valueSchema 296 xmlns:scr="http://www.isocat.org/ns/scr" 297 pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A" 298 type="ISO 14977:1996 EBNF"/> 299 \end{verbatim} 300 %\end{lstlisting} 301 \end{note} 302 303 304 Finally, the client application (e.g. a metadata editor) is configured/guided by the schema. 305 It can use the reference to the DC to fetch explanations (semantic information) (and translations) from ISOcat, but it is bound to the value range as restricted by the schema. 306 307 \todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?} 308 309 310 311 180 312 \subsection{CMDI - Exploitation side} 181 Metadata complying to the CMD-framework is being created by a growing number of institutions by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todo {What about Normalization?}. and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.313 Metadata complying to the CMD-framework is being created by a growing number of institutions by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}. and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing. 182 314 183 315 \begin{figure*}[!ht] … … 189 321 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings. 190 322 191 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todo { describe indexing and search}192 \todo { add citation}323 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search} 324 \todocite {MI Search Engine} 193 325 194 326 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers, … … 200 332 201 333 The requirements for these repositories: PIDs, CMD, OAI-PMH 202 \todo {cite:center-B paper}334 \todocite{center-B paper} 203 335 204 336 \section{Distrbuted system - federated search} -
SMC4LRT/chapters/Introduction.tex
r2697 r2703 14 14 15 15 16 \todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/} 17 18 16 19 \subsection{Problem statement} 17 20 18 21 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought. 19 22 20 \todo {Need some number about the disparity in the field, number of institutes, resources, formats.}23 \todoin{Need some number about the disparity in the field, number of institutes, resources, formats.} 21 24 22 25 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN. -
SMC4LRT/chapters/Literature.tex
r2697 r2703 4 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5 5 6 This work is guided by \todo {two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology and the method - Semantic Web technologies. This division is reflected in the following chapter:6 This work is guided by \todoin{two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology and the method - Semantic Web technologies. This division is reflected in the following chapter: 7 7 8 8 \section{(Infrastructure for) Language Resources and Technology} … … 27 27 \item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \footnote{\url{https://phaidra.univie.ac.at/}} 28 28 \item[eSciDoc] provided by MPG + FIZ Karlsruhe \footnote{\url{https://www.escidoc.org/}} 29 \item[TextGrid] \todocode{install: TextGrid2 - check: TG-search}\furl{http:/textgrid.de} 29 30 \item[DRIVER] pan-European infrastructure of Digital Repositories \footnote{\url{http://www.driver-repository.eu/}} 30 31 \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}} … … 42 43 43 44 \subsection{FederatedSearch} 44 45 \todoask{How to relate Federated Search to SMC? } 45 46 46 47 47 48 \section{Semantic Web} 48 49 49 \todo {cite TimBL}50 \todoin{cite TimBL} 50 51 51 52 \begin{description} … … 66 67 67 68 One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. 69 70 \todoin{check if relevant: http://schema.org/} 68 71 69 72 \subsection{Ontology Visualization} -
SMC4LRT/chapters/SMC.tex
r2696 r2703 3 3 4 4 5 \section{Data Model }5 \section{Data Model?} 6 6 7 7 Terms ? … … 10 10 RDF 11 11 12 13 12 \subsection{CMD namespace} 14 13 Describe the CMD-format? 15 16 17 \subsection{DCR in SKOS}18 \label{dcr-skos}19 Describe the mapping from DCR into SKOS20 21 DCR recognizes following types of data categories:22 simple, complex: closed, open, constrained, (container)?23 24 \begin{figure*}[!ht]25 \begin{center}26 \includegraphics[width=0.7\textwidth]{images/dc_types}27 \end{center}28 \caption{Data Category types}29 \end{figure*}30 \todo{cite: ISOcat introduction at CLARIN-NL Workshop}31 32 The export to CLAVAS-SKOS only considers/regards closed and simple DCs from the metadata profile are exported.33 A closed DC maps to a concept scheme and a simple DC to a SKOS concept in such a concept scheme.34 However it needs to be yet assessed how useful this approach is. In the metadata profile35 there are many closed DCs with small value domains. How useful are those36 in CLAVAS?37 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains,38 that do not fit easily in the DCR data-model.39 Therefore a threshold seems sensible, where only value domains with more40 then 20, 50 or 100 values are exported.41 42 Open or constrained DCs are not exported as they don't provide anything to a vocabulary. \todo{cite: Menzo2013-03-12 mail}43 However, they can become users of a CLAVAS vocabulary. Actually, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.44 45 Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category46 is by the means a XML Schema provides, like a regular expression. So for the data category \concept{languageID DC-2482}47 the rule looks like:48 \lstset{language=XML}49 \begin{lstlisting}50 <dcif:conceptualDomain type="constrained">51 <dcif:dataType>string</dcif:dataType>52 <dcif:ruleType>XML Schema regular expression</dcif:ruleType>53 <dcif:rule>[a-z]{3}</dcif:rule>54 </dcif:conceptualDomain>55 \end{lstlisting}56 57 A current proposal by Windhouwer\todo{cite: Menzo2013-03-12 mail} for integration with CLAVAS foresees following extension:58 59 \begin{lstlisting}60 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>61 \end{lstlisting}62 63 \code{@href} points to the vocabulary. Actually a PID should be used in the context64 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.65 66 \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are67 valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.68 69 This would yield a definition of the conceptualDomain for the data category as follows:70 71 \lstset{language=XML}72 \begin{lstlisting}73 <dcif:conceptualDomain type="constrained">74 <dcif:dataType>string</dcif:dataType>75 <dcif:ruleType>XML Schema regular expression</dcif:ruleType>76 <dcif:rule>[a-z]{3}</dcif:rule>77 </dcif:conceptualDomain>78 <dcif:conceptualDomain type="constrained">79 <dcif:dataType>string</dcif:dataType>80 <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>81 <dcif:rule>82 <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>83 </dcif:rule>84 </dcif:conceptualDomain>85 \end{lstlisting}86 87 I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS88 lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.89 90 91 \begin{note}92 93 \noindent94 something similar for the link to an EBNF grammar in SCHEMAcat:95 96 %\begin{lstlisting}97 \begin{verbatim}98 <scr:valueSchema99 xmlns:scr="http://www.isocat.org/ns/scr"100 pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"101 type="ISO 14977:1996 EBNF"/>102 \end{verbatim}103 %\end{lstlisting}104 \end{note}105 14 106 15 … … 234 143 \begin{enumerate} 235 144 \item express MDRecords in RDF 236 \item identify related ontologies/vocabularies (category ->vocabulary)145 \item identify related ontologies/vocabularies (category $\rightarrow$ vocabulary) 237 146 \item use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?) 238 147 239 148 %\fbox{ function lookup: Category x String -> ConceptualDomain} 240 149 \begin{eqnarray*} 241 lookup(Category, Literal) ->ConceptualDomain??150 lookup(Category, Literal) \rightarrow ConceptualDomain?? 242 151 \end{eqnarray*} 243 152 … … 249 158 \subsection{Linked Data - Express dataset in RDF} 250 159 160 161 I do think that ISOcat, CLAVAS, RELcat, an actual language 162 resource all provide a part of the semantic network. 163 164 And if you can express these all in RDF, which we can for almost all of them (maybe 165 except the actual language resource ... unless it has a schema adorned 166 with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for 167 metadata we have that in the CMDI profiles ...) you could load all the 168 relevant parts in a triple store and do your SPARQL/reasoning on it. Well 169 that's where I'm ultimately heading with all these registries related to 170 semantic interoperability ... I hope ;-) 171 \todocite{Menzo} 172 173 251 174 Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with 252 175 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud. … … 254 177 255 178 Technical aspects (RDF-store?) / interface (ontology browser?) 179 180 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} 256 181 257 182 defining the Mapping: 258 183 \begin{enumerate} 259 184 \item convert to RDF 260 translate: MDR Ecord ->[\#mdrecord \#property literal]261 \item map: \#mdrecord \#property literal ->[\#mdrecord \#property \#entity]185 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal] 186 \item map: \#mdrecord \#property literal $\rightarrow$ [\#mdrecord \#property \#entity] 262 187 \end{enumerate} 263 188 … … 266 191 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} 267 192 \caption{The process of transforming the CMD metadata records to and RDF representation} 193 \label{fig:smc_cmd2lod} 268 194 \end{figure*} 269 195 -
SMC4LRT/chapters/System.tex
r2697 r2703 56 56 57 57 58 \section{SMC LOD} 58 59 59 \section{User Interface} 60 \todoin{read: Europeana RDF Store Report} 61 62 \todocode{install Jena + fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site} 63 64 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?} 65 66 67 \section{User Interface?} 60 68 61 69 \subsection{Query Input} -
SMC4LRT/thesis.tex
r2695 r2703 42 42 % define custom macros for specific formats or names 43 43 44 \newcommand{\todo}[1]{\textcolor{red}{#1}} 45 \newcommand{\concept}[1]{\texttt{#1}} 46 \newcommand{\furl}[1]{\footnote{\url{#1}}} 47 \newcommand{\ftodo}[1]{\footnote{\todo{#1}}} 48 \newcommand{\xne}[1]{\textsf{#1}} 49 \newcommand{\cd}{\textsf{Class Diagram}} 50 44 \input{utils} 51 45 52 46 \setcounter{tocdepth}{2} … … 95 89 \appendix 96 90 91 \input{chapters/appendix} 92 97 93 \bibliographystyle{plain} 98 94 %\bibliography{references} -
SMC4LRT/utils.tex
r2695 r2703 6 6 \usetikzlibrary{arrows,automata} 7 7 8 \usepackage[textsize=footnotesize, textwidth=1in, colorinlistoftodos=1, 9 bordercolor=todoborder, linecolor=todoborder, backgroundcolor=todobg] 10 {todonotes} 8 % disable 9 \usepackage[textsize=footnotesize, textwidth=1in, colorinlistoftodos=1, bordercolor=todoborder, linecolor=todoborder, backgroundcolor=todobg]{todonotes} 11 10 12 11 \newcommand{\todoin}[1]{\todo[inline]{#1}} 12 \newcommand{\todocite}[1]{\todo[inline,backgroundcolor=cite]{#1}} 13 \newcommand{\todoask}[1]{\todo[inline,backgroundcolor=ask]{#1}} 14 \newcommand{\todocode}[1]{\todo[inline,backgroundcolor=code]{#1}} % anything that runs: installing, implementing, data transform 13 15 \newcommand{\concept}[1]{\textsf{#1}} 14 16 \newcommand{\code}[1]{\texttt{#1}} 15 17 \newcommand{\xne}[1]{\textsf{#1}} 16 18 \newcommand{\furl}[1]{\footnote{\url{#1}}} 17 \newcommand{\ftodo}[1]{\footnote{\todo {#1}}}19 \newcommand{\ftodo}[1]{\footnote{\todoin{#1}}} 18 20 19 21 \newenvironment{note} … … 31 33 32 34 \definecolor{todobg}{rgb}{0.8,0.8,1} 35 \definecolor{cite}{rgb}{0.8,1,0.8} 36 \definecolor{ask}{rgb}{1,1,0.8} 37 \definecolor{code}{rgb}{1,0.8,0.8} 33 38 \definecolor{todoborder}{rgb}{0.8,0.4,0.4} 34 39 35 40 36 41 \lstset{ 37 basicstyle=\ttfamily ,42 basicstyle=\ttfamily\footnotesize, 38 43 columns=fullflexible, 39 44 showstringspaces=false, … … 43 48 \lstdefinelanguage{XML} 44 49 { 45 basicstyle=\ttfamily\color{darkblue}\bfseries ,50 basicstyle=\ttfamily\color{darkblue}\bfseries\footnotesize, 46 51 morestring=[b]", 47 52 morestring=[s]{>}{<}, … … 52 57 morekeywords={xmlns,version,type}% list your attributes here 53 58 } 54
Note: See TracChangeset
for help on using the changeset viewer.