- Timestamp:
- 09/11/13 18:04:14 (11 years ago)
- Location:
- SMC4LRT/chapters
- Files:
-
- 10 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Conclusion.tex
r3204 r3551 8 8 9 9 More work is needed on consolidation of the actual values in the CMD records. CLARIN has set up a separate task force for data curation, which will have to be an ongoing effort. Also, work is ongoing on enriching the SMC browser with instance data information, allowing to directly see and inspect, which profiles and DCs are effectively being used in the instance data (and how often). 10 11 12 Irrespective of the additional levels - the user wants and has to get to the resource. (not always) 13 to the "original" -
SMC4LRT/chapters/Data.tex
r3140 r3551 13 13 14 14 \subsubsection{CMD Profiles } 15 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev } shows the development of the CR and DCR population over time.15 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time. 16 16 17 17 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements … … 21 21 \begin{table} 22 22 \caption{The development of defined profiles and DCs over time} 23 \label{table:dev }23 \label{table:dev_profiles} 24 24 \begin{tabular}{ l | r | r | r | r } 25 25 \hline … … 182 182 VIAF - Virtual International Authority File 183 183 184 184 185 Other related relevant activities and initiatives 185 186 … … 213 214 214 215 \section{LRT Metadata Catalogs/Collections} 215 216 \label{sec:lrt-md-catalogs} 216 217 \todoin{Overview of catalogs, name, since, \#providers, \#resources} 217 218 … … 240 241 241 242 \section{Other Metadata Catalogs/Collections} 243 \label{sec:other-md-catalogs} 242 244 243 245 \subsection{(Digital) Libraries} -
SMC4LRT/chapters/Definitions.tex
r3140 r3551 1 1 \chapter{Definitions} 2 3 Meanings of ``mapping'': 4 \begin{itemize} 5 \item transform 6 \item match (schemas) 7 \item overview (browser) 8 \end{itemize} 9 2 \label{ch:def} 10 3 11 4 \section {Namespaces} … … 25 18 \item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI} 26 19 \item[ERIC] \textit{European Research Infrastructure Consortium} - a legal entity for long-term research infrastructure initiatives 20 \item[DARIAH] \textit{Digital Research Infrastructure for Arts and Humanities} 27 21 \item[DC] data category 28 22 \item[DCR] data category registry \cite{ISO12620:2009} 23 \item[DH] Digital Humanities, also eHumanities 24 \item[LINDAT] czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz} 29 25 \item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC} 30 26 \item[PID] persistend identifier \todocite{PID} -
SMC4LRT/chapters/Design_SMCinstance.tex
r3240 r3551 1 \chapter{Design - Mapping on instance level} 2 3 4 Linked Data - Express dataset in RDF 5 1 \chapter{System design - mapping on instance level} 2 \label{ch:design-instance} 6 3 \begin{quotation} 7 4 I do think that ISOcat, CLAVAS, RELcat, an actual language … … 16 13 semantic interoperability ... I hope ;-) 17 14 \end{quotation} 18 \todocite{Menzo} 15 \cite{Menzo2013mail} 16 17 18 Linked Data - Express dataset in RDF 19 19 20 20 … … 234 234 235 235 \begin{example} 236 <lr1> dct:title"Language Resource 1"236 <lr1> & dct:title & "Language Resource 1" 237 237 \end{example} 238 238 … … 240 240 241 241 \begin{example} 242 <lr1> isocat:DC-2502"19th century"242 <lr1> & isocat:DC-2502 & "19th century" 243 243 \end{example} 244 244 … … 358 358 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?} 359 359 360 \section {Full semantic search - concept-based + ontology-driven ?} 361 362 With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset. 363 364 Namely to enhance it by employing ontological resources. 365 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects. 366 367 360 368 \section{Summary} 361 369 -
SMC4LRT/chapters/Design_SMCschema.tex
r3240 r3551 1 1 2 \chapter{ System Design - Mapping on schema level}2 \chapter{Concept-based mapping on schema level -- system design} 3 3 \label{ch:design} 4 4 5 In this chapter, we lay out the functioning of the semantic mapping on schema level, the task the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}). 6 Semantic interoperability was one of the main concerns addressed by the CMDI and is weaved in tightly in all modules of the infrastructure. The task of the SMC module is to collect information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata formats. This information serves as basis for the concept-based search. 7 8 We start by drawing a global view on the system, introducing its individual components and the dependencies among them. 9 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for resolving crosswalks is described, divided into the interface specification and actual implementation. In section \ref{def:concept_search} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} a advanced interactive user interface for exploring the CMD data domain is proposed. 10 5 11 \section{System Architecture} 6 12 7 The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN Metadata Service, its primary consuming service, but shall be equally usable by other applications. 8 13 The Semantic Mapping module is based on the DCR and CMD framework (cf. section \ref{def:DCR}) 14 and is being developed as a separate service on the side of CLARIN Metadata Service, its primary consuming service, but shall be equally usable by other applications. 15 16 17 \begin{figure*}[!ht] 18 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png} 19 \caption{The component view on the SMC - modules and their inter-dependencies} 20 \label{fig:smc_modules} 21 \end{figure*} 22 23 24 \begin{description} 25 \item[crosswalk service] the main service translating between indexes, detailed in \ref{sec:cx} 26 \item[concept-based query expansion] 27 \item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data 28 \item[SMC Browser] a web application to explore the CMD data domain consisting of the two modules: \xne{smc-stats} and \xne{smc-graph} 29 \item[smc-stats] a module of the \xne{SMC Browser} providing human-readable statistical summaries of the CMD data domain 30 \item[smc-graph] a module of the \xne{SMC Browser} providing advanced interactive graph-based user interface for exploring the CMD data domain 31 \end{description} 32 33 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}. 34 35 \section{Data model - Terms} 36 \label{datamodel-terms} 37 38 \todocode{Terms.xsd} 9 39 10 40 \begin{note} 11 Do we need separate \\section{Data Model}?12 41 Describe the CMD-format? 13 42 \end{note} 14 43 15 \begin{figure*}[!ht] 16 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png} 17 \caption{The process of transforming the CMD metadata records to and RDF representation} 18 \label{fig:smc_modules} 19 \end{figure*} 20 21 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}. 22 23 24 \subsection{Use Cases} 25 26 \begin{itemize} 27 28 \item MD Search employing Semantic Mapping 29 \item MD Search employing Fuzzy Search 30 \end{itemize} 31 32 \section{Crosswalks -- Mapping on schema level} 33 34 merging the pieces of information provided by those, 35 offering them semi-transaprently to the user (or application) on the consumption side. 36 37 a module of the Component Metadata Infrastructure performing semantic mapping on search indexes. This builds the base for query expansion to facilitate semantic search and enhance recall when querying the Metadata Repository. 38 44 \section{Crosswalk service} 45 \label{sec:cx} 46 Crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. It allows to translate between search indexes. In particular it expresses data category based indexes as equivalent paths to fields in the CMD profiles. This way it builds the base for query expansion enhancing the recall, when searching in the heterogeneous data collection of the joint CLARIN metadata domain. 39 47 40 48 … … 84 92 \subsection{Interface Specification} 85 93 86 In this section, we describe the actual task of the proposed application -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas. 87 \footnote{Though tightly related, mapping of terms and query expansion are to be seen as two separate functions.} 94 In this section, we describe the actual task of the proposed service -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas. 88 95 % \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.} 89 96 … … 99 106 \newline 100 107 101 \ texttt{isocat.size $\mapsto$ } \newline102 \verb| [teiHeader.extent, |\newline 103 \ verb| TextCorpusProfile.Number]|108 \begin{example} 109 isocat.size & $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number] 110 \end{example} 104 111 \newline 105 112 … … 107 114 \newline 108 115 109 \texttt{imdi-corpus.Name $\mapsto$ } \newline 110 \verb| (isocat.resourceName) |$\mapsto$ \newline 111 \verb| TextCorpusProfile.GeneralInfo.Name| 112 \newline 116 \begin{example} 117 imdi-corpus.Name & $\mapsto$ \\ 118 (isocat.resourceName) & $\mapsto$ TextCorpusProfile.GeneralInfo.Name 119 \end{example} 120 \newline 113 121 114 122 (2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes: … … 130 138 \verb| Person.Name, Person.FullName]| 131 139 132 \subsection{Initialization} 133 134 First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories: 140 141 \subsection{Implementation} 142 143 At the core of the described module is a set of XSL-stylesheets, governed by a ant-build file and a configuration file holding the information about individual source registries. 144 145 \todoin{generate and reference XSLT-documentation} 146 147 148 \subsubsection{Initialization} 149 150 First, there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{def:CMD}) and transforms it into the internal Terms format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories: 135 151 \newline 136 152 … … 142 158 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories. 143 159 160 \todocode{example of inverted index} 161 162 \subsubsection{Operation} 163 164 \subsubsection{Computing summaries} 144 165 145 166 \subsection{Extensions} … … 155 176 156 177 \section{Concept-based search} 157 158 Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources. 159 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, 160 with which the data will then be linked. These could be for example ontologies of Organizations and Projects. 161 162 In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user. 178 \label{def:concept_search} 179 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata. 180 In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user. 181 182 The emphasis lies on the query language and the corresponding query input interface. 183 163 184 Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user. 164 185 186 offering it (the information) semi-transparently to the user (or application) on the consumption side. 187 165 188 Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily. 189 166 190 167 191 ? … … 181 205 \subsection{SMC as module for Metadata Repository} 182 206 183 (MD)search frameworks: 184 185 \begin{description} 186 \item[Zebra/Z39.50] JZKit 187 \item[Lucene/Solr] 188 \item[eXist] - xml DB 189 \end{description} 190 207 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain. 208 209 Metadata repository is implemented in xquery running within the eXist XML-database as a web application. 210 211 212 \begin{figure*}[!ht] 213 \includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png} 214 \caption{The component view on the SMC - modules and their inter-dependencies} 215 \label{fig:modules-mdrepo} 216 \end{figure*} 191 217 192 218 193 219 \subsection{User Interface?} 194 220 221 195 222 \subsubsection*{Query Input} 223 224 225 \begin{figure*}[!ht] 226 \includegraphics[width=0.8\textwidth]{images/query_input_autocomplete_term.png} 227 \caption{A proposed query input interface offering concepts as search indexes} 228 \label{fig:query_input} 229 \end{figure*} 230 231 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. 196 232 197 233 \subsubsection*{Columns} … … 207 243 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf} 208 244 245 \section{SMC-Browser} 246 \label{smc-browser} 247 248 Explore the Component Metadata Framework 249 250 As the data set keeps growing both in numbers and in complexity, the call from the CMD community to provide advanced/enhanced ways for its exploration gets stronger. \textit{SMC browser} is one answer to this need. It is a web application, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used. 251 252 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted \cite{Broeder+2010}. 253 254 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (\code{componentA -includes-> componentB}) or referencing (\code{elementA -refersTo-> datcat1}).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). 255 209 256 210 257 \section{Summary} -
SMC4LRT/chapters/Infrastructure.tex
r3234 r3551 5 5 \section{CLARIN / CMDI} 6 6 \label{def:CLARIN} 7 \label{def:CMDI} 7 8 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to 8 9 … … 15 16 16 17 17 As stated before, the SMC is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{ method}, we introduce in short these modules and the data they provide:18 As stated before, the SMC is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{ch:design}, we introduce in short these modules and the data they provide: 18 19 19 20 \begin{itemize} … … 29 30 ?MDService 30 31 31 32 \ begin{figure*}[!ht]33 \ includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}34 \ caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}35 \end{figure*} 32 \begin{figure*}[!ht] 33 \includegraphics[width=0.8\textwidth]{images/CMDI_components_old.png} 34 \caption{The diagram (from early CLARIN/CMDI presentations) shows individual modules of the CMDI and their interrelations} 35 \end{figure*} 36 36 37 37 38 \subsection{CMDI - DCR/CR/RR} 38 \label{def: cmdi}39 \label{def: dcr}39 \label{def:CMD} 40 \label{def:DCR} 40 41 41 42 The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework. … … 46 47 % \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles. 47 48 49 50 \begin{figure*}[!ht] 51 \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2} 52 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping} 53 \end{figure*} 54 48 55 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions. 49 56 However there needs to be an additional means to capture information about relations between data categories. … … 69 76 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}. 70 77 71 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ method}.78 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ch:design}. 72 79 73 80 \subsection{Vocabulary Service / Reference Data Registry} … … 93 100 94 101 \subsubsection{Vocabulary Service - CLAVAS} 95 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added). 102 \label{def:CLAVAS} 103 As described in previous section (\ref{def:DCR}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added). 96 104 97 105 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge. … … 103 111 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), as well as Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/} are running an instance of OpenSKOS. 104 112 As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \emph{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \emph{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. As part of the process of adaptation to the needs of CLARIN and LRT-community data categories from \xne{ISOcat} have been converted into SKOS-format and ingested into the system. 105 \xne{ CLARIN Centre Vienna} is also running a prototypical instance of the OpenSKOS system with ISOcat data.113 \xne{Austrian Centre for Digital Humanities} is also running a prototypical instance of the OpenSKOS system with ISOcat data. 106 114 107 115 A plan has been developed/adopted to support further vocabularies relevant for the community. … … 114 122 115 123 See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies 116 and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcatto \xne{SKOS}.124 and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from \xne{ISOcat} to \xne{SKOS}. 117 125 118 126 \subsection{Interaction between DCR, VAS and client applications} … … 286 294 With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary). 287 295 288 In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).296 In ISOcat, such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning). 289 297 290 298 \begin{note} … … 306 314 It can use the reference to the DC to fetch explanations (semantic information) (and translations) from ISOcat, but it is bound to the value range as restricted by the schema. 307 315 308 \todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?}309 310 311 312 313 316 \subsection{CMDI - Exploitation side} 314 317 Metadata complying to the CMD-framework is being created by a growing number of institutions by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}. and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing. … … 328 331 and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011} 329 332 330 331 333 \section{Content Repositories} 332 334 Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources. … … 339 341 \section{Distrbuted system - federated search} 340 342 341 Metadata -> harvesting via OAI-PMH 342 but Content search has to be really distributed. 343 344 ? 343 Metadata -> harvesting via OAI-PMH, but Content search has to be really distributed. 344 345 345 \begin{description} 346 346 \item[Z39.50/SRU/SRW/CQL] LoC … … 348 348 \end{description} 349 349 350 350 351 \section{Summary} -
SMC4LRT/chapters/Introduction.tex
r3234 r3551 6 6 \section{Motivation / problem statement} 7 7 8 While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (chapter \ref{ch:data} analyses the disparity in the data domain)8 While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.) 9 9 10 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at fostering the development of common large-scale international infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common harmonized architecture for accessing and working with LRT. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:cmdi}) 11 -- a distributed system consisting of multiple interconnected applications aimed at creating and providing metadata for lLRT in a coherent harmonized way. 10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars by providing a common harmonized architecture for accessing and working with Language Resources and Technology (LRT). One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. 12 11 13 This work discusses a module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogenity of the resource descriptions, without the reductionist approach of trying to imposeone common description schema for all resources.12 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources. 14 13 15 14 \section{Main Goal} 16 15 17 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of L anguage Resources and Technology (LRT), henceforth referred to as \emph{semantic search} , distincting it from the necessary underlying processing, referred to as \emph{semantic mapping}.16 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the necessary underlying preprocessing, referred to as \xne{semantic mapping}. 18 17 19 18 The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work, … … 26 25 \end{description} 27 26 28 The work can further be divided along the schema / instance duality/dimension. Figure \ref{fig:master_outline} sketches the goals / conceptual space of this thesis.27 The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the relations between individual subgoals. 29 28 30 %\includegraphics[width=\unitlength]{images/master_outline.eps} 29 \begin{figure*}[!ht] 30 \begin{center} 31 %\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf} 32 \includegraphics{images/master_outline.png} 33 \end{center} 34 \caption{The conceptual space of this work} 31 35 \label{fig:master_outline} 32 \input{images/master_outline.eps_tex} 36 \end{figure*} 37 %\input{images/master_outline.eps_tex} 33 38 34 \subsubsection*{Crosswalk s}35 Goal is not primarily to produce the crosswalks but rather to develop the service serving them.39 \subsubsection*{Crosswalk service} 40 Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search. 36 41 37 ??? 38 39 While this may seem a rather trivial task, it is not if we consider the heterogeneity and complexity of the dataset, 40 further complicated by the fact, that this shall be community-driven process, without a central authority defining the relations 41 and that there may be even need for different relation sets for different tasks. In fact, a number of modules of the discussed infrastructure are dedicated to overcoming the semantic interoperability problem. 42 Thus, the goal is not primarily to produce the crosswalks but rather to develop the service serving existing ones. 42 43 43 44 \subsubsection*{Concept-based query expansion} 44 45 45 Once the crosswalks are available, they can be used to expand/translate user queries, to match related fields across heterogeneous metadata formats, resulting in higher recall.46 Once the crosswalks are available, they can be used to rewrite user queries (or to generate appropriate search indexes), so that they match related fields across heterogeneous metadata schemas resulting in higher recall when searching. 46 47 47 48 \paragraph{Example} 48 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be expandedto49 all the semantically near fields ( concept cluster), that are however labelled (or even structured) differently in other formats like49 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to 50 all the semantically near fields (\emph{concept cluster}), that are however labelled (or even structured) differently in other schemas like: 50 51 51 52 \begin{quote} … … 53 54 \end{quote} 54 55 55 but probably not to other fields, using same (sub)strings for the field labels 56 but with different semantics, like: 56 while other fields, labeled with the same (sub)strings but with different semantics shouldn't be considered: 57 57 58 58 \begin{quote} … … 62 62 \subsubsection*{Semantic interpretation} 63 63 64 The problem of different labels for semantically similar or even identical things is even more so virulent on the level of individual values in the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly/exhaustively enumerated. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work isto map (string) values in selected fields to entities defined in corresponding vocabularies.64 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies. 65 65 66 \subsubsection*{Ontology-driven search /data exploration}66 \subsubsection*{Ontology-driven data exploration} 67 67 68 B y applying semantic web technologies, the user will be given new means of \emph{exploring the dataset} through semantic resources (ontology-driven search/browsing/exploration).68 Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset} through semantic resources. 69 69 70 70 \paragraph{Example} 71 Ontology-driven search : Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-insemantic resources.71 Ontology-driven search -- Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external interlinked semantic resources. 72 72 73 73 \subsubsection*{Visualization} … … 75 75 76 76 \section{Method} 77 The primary concern of this work is the integrative effort, i.e. bringing together existing pieces (resources, components and methods). We start with examining the existing data and the description of the evolving infrastructure in which this work is embedded.77 We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded. 78 78 79 79 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure. … … 103 103 \section{Expected Results} 104 104 105 The main result of this work will be the \emph{specification} of the two modules \ texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.105 The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \texttt{crosswalk service}. 106 106 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components 107 107 and the results and findings of the \emph{evaluation}. … … 110 110 111 111 \begin{description} 112 \item [ Specification Semantic Mapping] design of the mapping mechanism113 \item [ Specification Semantic Search] design of the query expansion and integration with search engines114 \item [ Prototype] proof of concept implementation112 \item [Crosswalk service] specification and proof of basic implementation of the module 113 \item [Concept-based search] design of the query expansion and integration with search engines 114 \item [Visualization] design of an application for interactive exploration of the concerned dataset 115 115 \item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search 116 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets /ontologies/knowledgebases116 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases 117 117 \end{description} 118 118 … … 122 122 In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work. 123 123 124 The main part of the work is found in chapters \ref{ch:design} , \ref{ch:implementation} and \ref{ch:cmd2rdf} laying out the design of the software module, the proposal how to modell the data in RDF and the possibilities of visualizationrespectively.124 The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module, the proposal how to modell the data in RDF respectively. 125 125 126 126 The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future. -
SMC4LRT/chapters/Literature.tex
r3140 r3551 4 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5 5 6 This work is guided by \todoin{two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology and the method -Semantic Web technologies. This division is reflected in the following chapter:6 This work is guided by two main dimensions: the \textbf{data} -- in broad, Language Resource and Technology -- and the \textbf{method} -- Schema matching and Semantic Web technologies. This division is reflected in the following chapter: 7 7 8 8 \section{(Infrastructure for) Language Resources and Technology} … … 14 14 Chapter \ref{ch:data} examines the field of LRT in more detail. 15 15 16 16 17 \subsection{Metadata} 17 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder +2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders2009,Broeder2010}.18 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}. 18 19 19 Individual components of this infrastructure will be described in more detail in the section \ref{ch: components}.20 Individual components of this infrastructure will be described in more detail in the section \ref{ch:infra}. 20 21 22 A number of solution evolved in the recent years. 23 The first to undertake standardization efforts for the exchange of catalog information were digital libraries. 24 25 Z39.50 as base protocol, Worldcat, mapping/configuration files. 26 These catalogs are further described in the section \ref{sec:other-md-catalogs} 27 28 In the recent years the evolving research infrastructures all identified a common/harmonized search as a crucial component of the system and came up with a number of solutions, however often reduced to collecting metadata, reducing to dublincore 29 and offering a lucene/solr based facetted search. 30 These catalogs are further described in the section \ref{sec:lrt-md-catalogs}. 31 32 Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. 21 33 22 34 \subsection{Content Repositories} … … 79 91 \todoin{check if relevant: http://schema.org/} 80 92 93 \subsection{Existing Crosswalk services} 94 95 \url{http://www.oclc.org/developer/services/metadata-crosswalk-service} 96 97 http://semanticweb.org/wiki/VoID 98 http://www.dnb.de/rdf 99 81 100 \subsection{Ontology Visualization} 82 101 -
SMC4LRT/chapters/Results.tex
r3240 r3551 1 \chapter{Evaluation} 2 \label{ch:Evaluation} 3 4 5 \section{Sample Queries} 6 7 candidate Categories: 8 ResourceType, Format 9 Genre, Topic 10 Project, Institution, Person, Publisher 11 12 13 \section{Exploring Data Categories} 14 In the ISOcat DCR 791 DCss are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} In the following we describe two show cases -- \textit{Language} and \textit{name} -- in more detail. 1 \chapter{Results and Findings} 2 \label{ch:results} 3 4 In this chapter, the results of the work are presented, divided into two main areas: 5 6 software and data. 7 8 In two sections, we explore the CMD data domain - the usage of the data categories on the one hand and the integration of existing formats on the other hand. While these two aspects were not directly part of this work, they were a) made possible by output of this work (SMC-Browser, statistical analysis), b) yield a valuable test case for the usefulness of the work and c) are an indispensable prerequisit for the necessary curation work being carried out by the CMDI community. 9 10 \section{Current status of the infrastructure} 11 Before we get to the results of this work, we briefly summarize the current state of affairs within the CLARIN infrastructure at large to help contextualize the actual results. 12 13 \subsection{CMDI - services} 14 The main services of the infrastructure have been in stable production for the last two years. 15 Relation Registry is operational as early prototype. 16 Three instances of OpenSKOS are running, one of them being hosted by ACDH. 17 18 \subsection{CMDI - data} 19 More than 130 profiles are defined. (See \ref{table:dev_profiles} for more details about profiles.) 20 The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis. 21 The collection amounts to over 550.000 records in 64 profiles. 22 23 \subsection{ACDH - the home of SMC} 24 Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities, that provides depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure. 25 Figure \ref{fig:acdh_context} sketches the broader context of \xne{acdh} and its different roles. 26 27 28 \section {Software} 29 The specification of the system can be found in the chapters \ref{ch:design} and \ref{ch:design-instance}. 30 31 There is prototypical implementation for three parts of the system 32 33 \begin{itemize} 34 \item the crosswalk service as a REST web service 35 \item a module to integrate with a search engine 36 \item web application that allows advanced interaction with the data set 37 \end{itemize} 38 39 The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}. 40 41 Furthermore, the CMD data has been expressed RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}. 42 43 \subsection{SMC - crosswalks service} 44 45 The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. 46 47 \subsection{SMC - as a module within Metadata Repository} 48 There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running. 49 50 51 \subsection{SMC Browser -- Advanced Interactive User Interface} 52 53 SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph. 54 55 It is implemented on top of the js-library d3, the code is checked in clarin-svn. 56 57 The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched. 58 59 E.g. starting from 124 profiles, this amounts to a graph with ??? nodes and ??? edges. 60 61 \begin{figure*}[!ht] 62 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23} 63 \caption{Screenshot of the SMC browser} 64 \end{figure*} 65 66 SMC Browser also features detailed numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. 67 68 In the following section, we make extensive use of the output of this tool, to visualize individual aspects of the discussed data set. 69 70 \subsection{SMC LOD} 71 72 73 \section{Exploring the usage of data categories} 74 At the core of the whole SMC (and CMDI) are the data categories as basic conceptual building blocks or anchors. 75 We want to take a closer look on the usage of the data categories in the CMD infrastructure, examplifying on a few very common concepts -- \concept{language}, \concept{name}, \concept{resource type}, \concept{???}. 76 77 In the ISOcat DCR 791 DCs are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} 15 78 16 79 \subsection{Language} … … 36 99 37 100 38 \subsection{Name }101 \subsection{Name / Title} 39 102 There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs. 40 103 Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data. … … 46 109 \subsection{Subject, Genre, Topic} 47 110 48 \section{Mapping existing Formats} 111 \section{Exploring the integration of existing formats} 112 113 CLARIN set out with the aspiration /yearning to overcome the babylon of metadata formats 114 and its flexible CMD metamodel is specifically designed to integrate existing formats. 115 In this section, we want to elaborate on/analyze the state of integration efforts for 4 major formats: \xne{dublincore/OLAC}, \xne{teiHeader} and \xne{META-SHARE resourceInfo}. 49 116 50 117 \subsection{dublincore / OLAC} 51 118 52 Very widely used format119 Very widely used (because) simple format 53 120 \ref{info:olac-records} 54 121 55 There are 4-5 CMD profiles modelling OLAC/dcmi-terms 56 122 Here the problem of proliferation seems especially virulent. Table \ref{table:dcterms-profiles} lists all the profiles modelling dcterms. 123 As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community. 124 125 \begin{figure*}[!ht] 126 \begin{center} 127 \includegraphics[width=0.5\textwidth]{images/dcmiterms-profiles.png} 128 \end{center} 129 \caption{The meanwhile four DCMI profiles with identical conceptual linking} 130 \label{fig:dcmi-profiles} 131 \end{figure*} 132 133 134 \begin{table} 135 \caption{Profiles modelling dublincore terms} 136 \label{table:dcterms-profiles} 137 \begin{tabular}{ l | l | l | r | r } 138 \hline 139 profile name & created & creator & count & instances \\ 140 \hline 141 component-dc-terms-modular & 2010-04-21 & CMDI-team & 15 / 15 / 15 \\ 142 component-dc-terms & 2010-04-21 & CMDI-team & 0 / 15 / 15 \\ 143 DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\ 144 OLAC-DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\ 145 OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02-12 & Menzo Windhouwer & 1 / 71 / 62 & \\ 146 DC-UBU & 2013-05-29& Utrecht University Library & 0 / 15 / 15 & \\ 147 OLAC-DcmiTerms-ref & 2013-06-24 & fankhauser@ids-mannheim.de & 0 / 55 / 55 & \\ 148 \hline 149 \end{tabular} 150 \end{table} 151 152 Additionally, there is a number of profiles with concept links to dublincore terms, 153 Some use all of the dublincore elements or terms as one component within a larger profile, 154 one example being the \xne{data} profile created by the Czech initiative LINDAT modells the minimal obligatory set of META-SHARE \xne{resourceInfo}) combined with a simple dublincore record (see also subsection about META-SHARE below). 155 Other profiles refer only to some data categories. Most often used: \concept{Title} (used in 33 profiles) and \concept{Creator} (in 29 profiles). 156 Profiles that make more frequent use of the dublincore terms: 157 158 \begin{itemize} 159 \item EastRepublican (8) 160 \item HZSKCorpus (17) 161 \item teiHeader (8) 162 \item ToolService (15) 163 \item OralHistoryInterviewDANS (15) 164 \end{itemize} 165 166 \begin{figure*}[!ht] 167 \begin{center} 168 \includegraphics[width=0.8\textwidth]{images/profiles_using_dcmiterms.png} 169 \end{center} 170 \caption{Profiles referring to at least some of the dublincore data categories/terms} 171 \label{fig:profiles-using-dcmiterms} 172 \end{figure*} 57 173 58 174 … … 65 181 The widespread use of TEI for encoding textual resources brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat. 66 182 67 The large research project \xne{Deutsches Text Archiv}\furl{http://deutschestextarchiv.de/}\todocite{DTA}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information.68 \todoin{Why a separate cmd-profile} 69 70 \xne{Nederlab} is another large-scale project concerned with \todoin{dutch? historic texts}, starting 2013 in Netherlands\todocite{Nederlab}. Within this projectanother set of CMD profiles was created, however reusing existing components.71 As seen in figure \ref{fig:teiHeade er_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.72 73 Another approach was applied within the context of other CLARIN-NL projects , \todocite{Windhouwer, 2012} generated, based on an ODD-file, a data category for every element of the teiHeader (135 datcats) creating a dedicated data category selection: \xne{TEI Header (2.1.0)}. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:components}. The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.183 The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/}\cite{Geyken2011deutsches}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project. 184 Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold\cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile. 185 186 \xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however reusing existing components. 187 As seen in figure \ref{fig:teiHeader_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added. 188 189 Another approach was applied within the context of other CLARIN-NL projects\cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs. 74 190 This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question. 75 191 … … 87 203 \begin{tabular}{ l | r | l | r | r | r} 88 204 \hline 89 pro ject, author & created & profile name & comp elem datcats& instances \\90 \hline 91 Deutsches Text Archiv & 2012 & teiHeader & 56/82/10 & 857 \\92 ICLTT, Durco & 2010 & teiHeader & 16/35/13 & 467 \\93 Leipzig Corpora, Eckart & 2012 & TEIDocumentDescription& 16/35/13 & ? \\94 Nederlab, Zhang & 2013 & DBNL\_Tekst & 20/38,15 & ?\\95 & & DBNL\_Tekst\_Onzelfstandig & 20/47/21 & ?\\205 profile name & created & creator & count & instances \\ 206 \hline 207 teiHeader & 2010 & ICLTT, Durco & 16/35/13 & 467 \\ 208 teiHeader & 2012 & Deutsches Text Archiv & 56/82/10 & 857 \\ 209 TEIDocumentDescription & 2012 & Leipzig Corpora, Eckart & 16/35/13 & ? \\ 210 DBNL\_Tekst & 2013 & Nederlab, Zhang & 20/38,15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\ 211 DBNL\_Tekst\_Onzelfstandig & & & 20/47/21 & \\ 96 212 \hline 97 213 \end{tabular} 98 214 \end{table} 99 215 100 \todoin{DBNL\_Tekst\_Onzelfstandig - how many instances?}101 102 216 DBNL\_Tekst clarin.eu:cr1:p\_1361876010678, 103 217 clarin.eu:cr1:p 1366279029218 (private) … … 108 222 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components. 109 223 %In cooperation between metadata teams from CLARIN and META-SHARE 110 The model has been expressed as 4 CMD profiles for distinct resource types sharing most of the components. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 419 components and 1587 elements (when expanded). Although most of the elements are optional 111 112 resourceInfo 419 1587 72 790 797 50.22 % 113 \todoin{how many distinct components/elements} 114 This? shows nicely the trade-off between the two different approaches between CMD and META-SHARE: many custom schemas or one very large. 115 116 In a parallel effort, LINDAT, the czech national infrastructure initiative with ties to both CLARIN and META-SHARE, created a CMD profile modelling the minimal obligatory set of META-SHARE. combined with dublincore. 117 So the information is partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema 118 119 resourceInfo 65 92 21 82 10 10.87 % 120 224 225 \begin{figure*}[!ht] 226 \begin{center} 227 \includegraphics[width=0.5\textwidth]{images/SMC-resourceInfo.png} 228 \end{center} 229 \caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements } 230 \label{fig:resource_info_5} 231 \end{figure*} 232 233 \begin{table} 234 \caption{Profiles modelling resourceInfo} 235 \label{table:resourceinfo-profiles} 236 \begin{tabular}{ l | l | l | r | r } 237 \hline 238 profile name & created & creator & count & instances \\ 239 \hline 240 resourceInfo (minimal) & 2013-02-13 & LINDAT.CZ & 34 / 41 / 21 \\ 241 resourceInfo (lexical) & 2013-06-02 & P. Labropoulou & 86 / 226 / 57 \\ 242 resourceInfo (tools) & 2013-06-02 & P. Labropoulou & 61 / 176 / 52 \\ 243 resourceInfo (language) & 2013-06-02 & P. Labropoulou & 89 / 228 / 54 \\ 244 resourceInfo (corpus) & 2013-06-02 & P. Labropoulou & 117 / 337 / 72 \\ 245 \hline 246 \end{tabular} 247 \end{table} 248 249 The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. 250 251 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\xne{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \xne{resourceInfo}), however combined with a simple dublincore record. 252 This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema. 121 253 122 254 \begin{figure*}[!ht] … … 137 269 138 270 139 \section{Summary} 140 141 142 143 \chapter{Results} 144 \label{ch:results} 145 146 147 \section { Software module} 148 149 The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}. 150 151 152 \subsection{SMC Browser -- Advanced Interactive User Interface} 153 154 Explore the Component Metadata Framework 155 156 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010). 157 158 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). 159 160 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration. 161 162 It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon. 163 164 The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset. 271 272 \section{Evaluation} 273 \label{evaluation} 274 275 Sample Queries: 276 277 candidate Categories: 278 ResourceType, Format 279 Genre, Topic 280 Project, Institution, Person, Publisher 281 282 283 284 \subsection{Use Cases} 285 286 \begin{itemize} 287 288 \item MD Search employing Semantic Mapping 289 \item MD Search employing Fuzzy Search 290 \end{itemize} 165 291 166 292 … … 173 299 \section{Summary} 174 300 175 176 \begin{figure*}[!ht] 177 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23} 178 \caption{Screenshot of the SMC browser} 179 \end{figure*} 180 181 301 The direct comparison of the CMD approach of metamodel allowing to generate custom profiles with shared semantics and a more traditional way of trying to generate one schema to fit all in as in META-SHARE shows nicely the trade-off: many custom schemas or one very large. 302 -
SMC4LRT/chapters/appendix.tex
r3240 r3551 9 9 \includegraphics[width=1\textwidth]{images/DCR_data_model.jpg} 10 10 \end{center} 11 \caption{DC R data model}11 \caption{DCIF -- the data model for the Data Category Registry as defined by the ISO Standard ISO12620:2009 \cite{ISO12620:2009}} 12 12 \label{fig:DCR_data_model} 13 13 \end{figure*} 14 \todocite{DCR data model}15 14 16 15 \begin{figure*}[!ht] … … 21 20 \label{fig:ref_arch} 22 21 \end{figure*} 22 23 \begin{figure*}[!ht] 24 \begin{center} 25 \includegraphics[width=1\textwidth]{images/acdh-diagram_300dpi_rotated.png} 26 \end{center} 27 \caption{Austrian Centre for Digital Humanities - the home of SMC - in context} 28 \label{fig:acdh_context} 29 \end{figure*} 30 31 \section {SMC Reports} 32 \label{sec:reports} 33 34 SCM Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}. 35 36 37 \input{chapters/examples_cleaned}
Note: See TracChangeset
for help on using the changeset viewer.