Changeset 3551 for SMC4LRT


Ignore:
Timestamp:
09/11/13 18:04:14 (11 years ago)
Author:
vronk
Message:

intermediate version - ongoing work on introduction

Location:
SMC4LRT/chapters
Files:
10 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Conclusion.tex

    r3204 r3551  
    88
    99More work is needed on consolidation of the actual values in the CMD records. CLARIN has set up a separate task force for data curation, which will have to be an ongoing effort. Also, work is ongoing on enriching the SMC browser with instance data information, allowing to directly see and inspect, which profiles and DCs are effectively being used in the instance data (and how often).
     10
     11
     12Irrespective of the additional levels - the user wants and has to get to the resource. (not always)
     13to the "original"
  • SMC4LRT/chapters/Data.tex

    r3140 r3551  
    1313
    1414\subsubsection{CMD Profiles }
    15 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
     15In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
    1616
    1717Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
     
    2121\begin{table}
    2222\caption{The development of defined profiles and DCs over time}
    23 \label{table:dev}
     23\label{table:dev_profiles}
    2424  \begin{tabular}{ l | r | r | r | r }
    2525    \hline
     
    182182VIAF - Virtual International Authority File
    183183
     184
    184185Other related relevant activities and initiatives
    185186
     
    213214
    214215\section{LRT Metadata Catalogs/Collections}
    215 
     216\label{sec:lrt-md-catalogs}
    216217\todoin{Overview of catalogs, name, since, \#providers, \#resources}
    217218
     
    240241
    241242\section{Other Metadata Catalogs/Collections}
     243\label{sec:other-md-catalogs}
    242244
    243245\subsection{(Digital) Libraries}
  • SMC4LRT/chapters/Definitions.tex

    r3140 r3551  
    11\chapter{Definitions}
    2 
    3 Meanings of ``mapping'':
    4 \begin{itemize}
    5 \item transform 
    6 \item match (schemas)
    7 \item  overview (browser)
    8 \end{itemize} 
    9 
     2\label{ch:def}
    103
    114\section {Namespaces}
     
    2518\item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI}
    2619\item[ERIC] \textit{European Research Infrastructure  Consortium} - a legal entity for long-term research infrastructure initiatives
     20\item[DARIAH] \textit{Digital Research Infrastructure for Arts and Humanities}
    2721\item[DC] data category
    2822\item[DCR] data category registry \cite{ISO12620:2009}
     23\item[DH] Digital Humanities, also eHumanities
     24\item[LINDAT] czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz}
    2925\item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC}
    3026\item[PID] persistend identifier \todocite{PID}
  • SMC4LRT/chapters/Design_SMCinstance.tex

    r3240 r3551  
    1 \chapter{Design - Mapping on instance level}
    2 
    3 
    4 Linked Data - Express dataset in RDF
    5 
     1\chapter{System design - mapping on instance level}
     2\label{ch:design-instance}
    63\begin{quotation}
    74I do think that ISOcat, CLAVAS, RELcat, an actual language
     
    1613semantic interoperability ... I hope ;-)
    1714\end{quotation}
    18 \todocite{Menzo}
     15\cite{Menzo2013mail}
     16
     17
     18Linked Data - Express dataset in RDF
    1919
    2020
     
    234234
    235235\begin{example}
    236 <lr1> dct:title "Language Resource 1"
     236<lr1> & dct:title & "Language Resource 1"
    237237\end{example}
    238238
     
    240240
    241241\begin{example}
    242 <lr1> isocat:DC-2502 "19th century"
     242<lr1> & isocat:DC-2502 & "19th century"
    243243\end{example}
    244244
     
    358358\todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
    359359
     360\section {Full semantic search - concept-based + ontology-driven ?}
     361
     362With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
     363
     364Namely to enhance it by employing ontological resources.
     365Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
     366
     367
    360368\section{Summary}
    361369
  • SMC4LRT/chapters/Design_SMCschema.tex

    r3240 r3551  
    11
    2 \chapter{System Design - Mapping on schema level}
     2\chapter{Concept-based mapping on schema level -- system design}
    33\label{ch:design}
    44
     5In this chapter, we lay out the functioning of the semantic mapping on schema level, the task the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}).
     6Semantic interoperability was one of the main concerns addressed by the CMDI and is weaved in tightly in all modules of the infrastructure. The task of the SMC module is to collect information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata formats. This information serves as basis for the concept-based search.
     7
     8We start by drawing a global view on the system, introducing its individual components and the dependencies among them.
     9In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for resolving crosswalks is described, divided into the interface specification and actual implementation. In section \ref{def:concept_search} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} a advanced interactive user interface for exploring the CMD data domain is proposed.
     10
    511\section{System Architecture}
    612
    7 The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
    8 
     13The Semantic Mapping module is based on the DCR and CMD framework (cf. section \ref{def:DCR})
     14and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
     15
     16
     17\begin{figure*}[!ht]
     18\includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
     19\caption{The component view on the SMC - modules and their inter-dependencies}
     20\label{fig:smc_modules}
     21\end{figure*}
     22
     23
     24\begin{description}
     25\item[crosswalk service] the main service translating between indexes, detailed in \ref{sec:cx}
     26\item[concept-based query expansion]
     27\item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data
     28\item[SMC Browser] a web application to explore the CMD data domain consisting of the two modules: \xne{smc-stats} and \xne{smc-graph}
     29\item[smc-stats] a module of the \xne{SMC Browser} providing human-readable statistical summaries of the CMD data domain
     30\item[smc-graph] a module of the \xne{SMC Browser} providing advanced interactive graph-based user interface for exploring the CMD data domain
     31\end{description}
     32
     33For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
     34
     35\section{Data model - Terms}
     36\label{datamodel-terms}
     37
     38\todocode{Terms.xsd}
    939
    1040\begin{note}
    11 Do we need separate \\section{Data Model}?
    1241Describe the CMD-format?
    1342\end{note}
    1443
    15 \begin{figure*}[!ht]
    16 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
    17 \caption{The process of transforming the CMD metadata records to and RDF representation}
    18 \label{fig:smc_modules}
    19 \end{figure*}
    20 
    21 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
    22 
    23 
    24 \subsection{Use Cases}
    25 
    26 \begin{itemize}
    27 
    28 \item MD Search employing Semantic Mapping
    29 \item MD Search employing Fuzzy Search
    30 \end{itemize}
    31 
    32 \section{Crosswalks -- Mapping on schema level}
    33 
    34 merging the pieces of information provided by those,
    35 offering them semi-transaprently to the user (or application) on the consumption side.
    36 
    37 a module of the Component Metadata Infrastructure performing semantic mapping on search indexes. This  builds the base for query expansion to facilitate semantic search and enhance recall when querying the Metadata Repository.
    38 
     44\section{Crosswalk service}
     45\label{sec:cx}
     46Crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. It allows to translate between search indexes. In particular it expresses data category based indexes as equivalent paths to fields in the CMD profiles. This way it builds the base for query expansion enhancing the recall, when searching in the heterogeneous data collection of the joint CLARIN metadata domain.
    3947
    4048
     
    8492\subsection{Interface Specification}
    8593
    86 In this section, we describe the actual task of the proposed application -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
    87 \footnote{Though tightly related, mapping of terms and query expansion are to be seen as two separate functions.}
     94In this section, we describe the actual task of the proposed service -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
    8895% \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.}
    8996
     
    99106\newline
    100107
    101 \texttt{isocat.size $\mapsto$ } \newline
    102 \verb|   [teiHeader.extent, |\newline
    103 \verb|    TextCorpusProfile.Number]|
     108\begin{example}
     109isocat.size     & $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number]
     110\end{example}
    104111\newline
    105112
     
    107114\newline
    108115
    109 \texttt{imdi-corpus.Name   $\mapsto$ } \newline
    110 \verb|   (isocat.resourceName) |$\mapsto$  \newline
    111 \verb|   TextCorpusProfile.GeneralInfo.Name|
    112 \newline
     116\begin{example}
     117imdi-corpus.Name & $\mapsto$ \\
     118(isocat.resourceName) & $\mapsto$ TextCorpusProfile.GeneralInfo.Name
     119\end{example}   
     120\newline
    113121
    114122(2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes:
     
    130138\verb|     Person.Name, Person.FullName]|
    131139
    132 \subsection{Initialization}
    133 
    134 First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
     140
     141\subsection{Implementation}
     142
     143At the core of the described module is a set of XSL-stylesheets, governed by a ant-build file and a configuration file holding the information about individual source registries.
     144
     145\todoin{generate and reference XSLT-documentation}
     146
     147
     148\subsubsection{Initialization}
     149
     150First, there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{def:CMD}) and transforms it into the internal Terms format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
    135151\newline
    136152
     
    142158Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
    143159
     160\todocode{example of inverted index}
     161
     162\subsubsection{Operation}
     163
     164\subsubsection{Computing summaries}
    144165
    145166\subsection{Extensions}
     
    155176
    156177\section{Concept-based search}
    157 
    158 Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
    159 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
    160 with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
    161 
    162 In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
     178\label{def:concept_search}
     179To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
     180In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
     181
     182The emphasis lies on the query language and the corresponding query input interface.
     183
    163184Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
    164185
     186offering it (the information) semi-transparently to the user (or application) on the consumption side.
     187
    165188Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
     189
    166190
    167191?
     
    181205\subsection{SMC as module for Metadata Repository}
    182206
    183 (MD)search frameworks:
    184 
    185 \begin{description}
    186 \item[Zebra/Z39.50] JZKit
    187 \item[Lucene/Solr]
    188 \item[eXist] - xml DB
    189 \end{description}
    190 
     207As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain.
     208
     209Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
     210
     211
     212\begin{figure*}[!ht]
     213\includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
     214\caption{The component view on the SMC - modules and their inter-dependencies}
     215\label{fig:modules-mdrepo}
     216\end{figure*}
    191217
    192218
    193219\subsection{User Interface?}
    194220
     221
    195222\subsubsection*{Query Input}
     223
     224
     225\begin{figure*}[!ht]
     226\includegraphics[width=0.8\textwidth]{images/query_input_autocomplete_term.png}
     227\caption{A proposed query input interface offering concepts as search indexes}
     228\label{fig:query_input}
     229\end{figure*}
     230
     231Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
    196232
    197233\subsubsection*{Columns}
     
    207243\todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
    208244
     245\section{SMC-Browser}
     246\label{smc-browser}
     247
     248Explore the Component Metadata Framework
     249
     250As the data set keeps growing both in numbers and in complexity, the call from the CMD community to provide advanced/enhanced ways for its exploration gets stronger. \textit{SMC browser} is one answer to this need. It is a web application, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
     251
     252In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted \cite{Broeder+2010}.
     253
     254Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (\code{componentA -includes-> componentB}) or referencing (\code{elementA -refersTo-> datcat1}).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
     255
    209256
    210257\section{Summary}
  • SMC4LRT/chapters/Infrastructure.tex

    r3234 r3551  
    55\section{CLARIN / CMDI}
    66\label{def:CLARIN}
     7\label{def:CMDI}
    78CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to
    89
     
    1516
    1617
    17 As stated before, the SMC is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{method}, we introduce in short these modules and the data they provide:
     18As stated before, the SMC is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{ch:design}, we introduce in short these modules and the data they provide:
    1819
    1920\begin{itemize}
     
    2930?MDService
    3031
    31 
    32 \begin{figure*}[!ht]
    33 \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
    34 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
    35 \end{figure*}
     32\begin{figure*}[!ht]
     33\includegraphics[width=0.8\textwidth]{images/CMDI_components_old.png}
     34\caption{The diagram (from early CLARIN/CMDI presentations) shows individual modules of the CMDI and their interrelations}
     35\end{figure*}
     36
    3637
    3738\subsection{CMDI - DCR/CR/RR}
    38 \label{def:cmdi}
    39 \label{def:dcr}
     39\label{def:CMD}
     40\label{def:DCR}
    4041
    4142The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
     
    4647% \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles.
    4748
     49
     50\begin{figure*}[!ht]
     51\includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
     52\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
     53\end{figure*}
     54       
    4855The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
    4956However there needs to be an additional means to capture information about relations between data categories.
     
    6976from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
    7077
    71 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
     78Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ch:design}.
    7279
    7380\subsection{Vocabulary Service / Reference Data Registry}
     
    93100
    94101\subsubsection{Vocabulary Service - CLAVAS}
    95 As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
     102\label{def:CLAVAS}
     103As described in previous section (\ref{def:DCR}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
    96104
    97105This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
     
    103111Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), as well as Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/} are running an instance of OpenSKOS.
    104112As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \emph{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \emph{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. As part of the process of adaptation to the needs of CLARIN and LRT-community data categories from \xne{ISOcat} have been converted into SKOS-format and ingested into the system.
    105 \xne{CLARIN Centre Vienna} is also running a prototypical instance of the OpenSKOS system with ISOcat data.
     113\xne{Austrian Centre for Digital Humanities} is also running a prototypical instance of the OpenSKOS system with ISOcat data.
    106114
    107115A plan has been developed/adopted to support further vocabularies relevant for the community.
     
    114122
    115123See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
    116 and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
     124and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from \xne{ISOcat} to \xne{SKOS}.
    117125
    118126\subsection{Interaction between DCR, VAS and client applications}
     
    286294With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
    287295
    288  In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
     296 In ISOcat, such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
    289297
    290298\begin{note}
     
    306314It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
    307315
    308 \todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?}
    309 
    310 
    311 
    312        
    313316\subsection{CMDI - Exploitation side}
    314317Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
     
    328331and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}
    329332
    330 
    331333\section{Content Repositories}
    332334Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources.
     
    339341\section{Distrbuted system - federated search}
    340342
    341 Metadata -> harvesting via OAI-PMH
    342 but Content search has to be really distributed.
    343 
    344 ?
     343Metadata -> harvesting via OAI-PMH, but Content search has to be really distributed.
     344
    345345\begin{description}
    346346\item[Z39.50/SRU/SRW/CQL] LoC
     
    348348\end{description}
    349349
     350
    350351\section{Summary}
  • SMC4LRT/chapters/Introduction.tex

    r3234 r3551  
    66\section{Motivation / problem statement}
    77
    8 While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (chapter \ref{ch:data} analyses the disparity in the data domain)
     8While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
    99
    10 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at fostering the development of common large-scale international infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common harmonized architecture for accessing and working with LRT. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:cmdi})
    11 -- a distributed system consisting of multiple interconnected applications aimed at creating and providing metadata for lLRT in a coherent harmonized way.
     10This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars by providing a common harmonized architecture for accessing and working with Language Resources and Technology (LRT). One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
    1211
    13 This work discusses a module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogenity of the resource descriptions, without the reductionist approach of trying to impose one common description schema for all resources.
     12This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
    1413
    1514\section{Main Goal}
    1615
    17 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of Language Resources and Technology (LRT), henceforth referred to as \emph{semantic search} , distincting it from the necessary underlying processing, referred to as \emph{semantic mapping}.
     16The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the necessary underlying preprocessing, referred to as \xne{semantic mapping}.
    1817
    1918The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
     
    2625\end{description}
    2726
    28 The work can further be divided along the schema / instance duality/dimension. Figure \ref{fig:master_outline} sketches the goals / conceptual space of this thesis.
     27The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the relations between individual subgoals.
    2928
    30 %\includegraphics[width=\unitlength]{images/master_outline.eps}
     29\begin{figure*}[!ht]
     30\begin{center}
     31%\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
     32\includegraphics{images/master_outline.png}
     33\end{center}
     34\caption{The conceptual space of this work}
    3135\label{fig:master_outline}
    32 \input{images/master_outline.eps_tex}
     36\end{figure*}
     37%\input{images/master_outline.eps_tex}
    3338
    34 \subsubsection*{Crosswalks}
    35 Goal is not primarily to produce the crosswalks but rather to develop the service serving them.
     39\subsubsection*{Crosswalk service}
     40Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search.
    3641
    37 ???
    38 
    39 While this may seem a rather trivial task, it is not if we consider the heterogeneity and complexity of the dataset,
    40 further complicated by the fact, that this shall be community-driven process, without a central authority defining the relations
    41 and that there may be even need for different relation sets for different tasks. In fact, a number of modules of the discussed infrastructure are dedicated to overcoming the semantic interoperability problem.
     42Thus, the goal is not primarily to produce the crosswalks but rather to develop the service serving existing ones.
    4243
    4344\subsubsection*{Concept-based query expansion}
    4445
    45 Once the crosswalks are available, they can be used to expand/translate user queries, to match related fields across heterogeneous metadata formats, resulting in higher recall.
     46Once the crosswalks are available, they can be used to rewrite user queries (or to generate appropriate search indexes), so that they match related fields across heterogeneous metadata schemas resulting in higher recall when searching.
    4647
    4748\paragraph{Example}
    48 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be expanded to
    49 all the semantically near fields (concept cluster), that are however labelled (or even structured) differently in other formats like
     49Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to
     50all the semantically near fields (\emph{concept cluster}), that are however labelled (or even structured) differently in other schemas like:
    5051
    5152\begin{quote}
     
    5354\end{quote}
    5455
    55 but probably not to other fields, using same (sub)strings for the field labels
    56 but with different semantics, like:
     56while other fields, labeled with the same (sub)strings but with different semantics shouldn't be considered:
    5757
    5858\begin{quote}
     
    6262\subsubsection*{Semantic interpretation}
    6363
    64 The problem of different labels for semantically similar or even identical things is even more so virulent on the level of individual values in the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly/exhaustively enumerated. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies.
     64The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
    6565
    66 \subsubsection*{Ontology-driven search / data exploration}
     66\subsubsection*{Ontology-driven data exploration}
    6767
    68 By applying semantic web technologies, the user will be given new means of \emph{exploring the dataset} through semantic resources (ontology-driven search/browsing/exploration).
     68Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset} through semantic resources.
    6969
    7070\paragraph{Example}
    71 Ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
     71Ontology-driven search -- Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external interlinked semantic resources.
    7272
    7373\subsubsection*{Visualization}
     
    7575
    7676\section{Method}
    77 The primary concern of this work is the integrative effort, i.e. bringing together existing pieces (resources, components and methods). We start with examining the existing data and the description of the evolving infrastructure in which this work is embedded.
     77We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.
    7878
    7979Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
     
    103103\section{Expected Results}
    104104
    105 The main result of this work will be the \emph{specification} of the two modules \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.
     105The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \texttt{crosswalk service}.
    106106This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
    107107and the results and findings of the \emph{evaluation}.
     
    110110
    111111\begin{description}
    112 \item [Specification Semantic Mapping] design of the mapping mechanism
    113 \item [Specification Semantic Search] design of the query expansion and integration with search engines
    114 \item [Prototype] proof of concept implementation
     112\item [Crosswalk service] specification and proof of basic implementation of the module
     113\item [Concept-based search] design of the query expansion and integration with search engines
     114\item [Visualization] design of an application for interactive exploration of the concerned dataset
    115115\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
    116 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
     116\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases
    117117\end{description}
    118118
     
    122122In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work.
    123123
    124 The main part of the work is found in chapters \ref{ch:design}, \ref{ch:implementation} and \ref{ch:cmd2rdf} laying out the design of the software module, the proposal  how to modell the data in RDF and the possibilities of visualization respectively.
     124The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module, the proposal how to modell the data in RDF respectively.
    125125
    126126The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
  • SMC4LRT/chapters/Literature.tex

    r3140 r3551  
    44%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    55
    6 This work is guided by \todoin{two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology  and the method - Semantic Web technologies. This division is reflected in the following chapter:
     6This work is guided by two main dimensions: the \textbf{data} -- in broad, Language Resource and Technology  -- and the \textbf{method} -- Schema matching and Semantic Web technologies. This division is reflected in the following chapter:
    77
    88\section{(Infrastructure for) Language Resources and Technology}
     
    1414Chapter \ref{ch:data} examines the field of LRT in more detail.
    1515
     16
    1617\subsection{Metadata}
    17 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder+2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders2009,Broeder2010}.
     18A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.
    1819
    19 Individual components of this infrastructure will be described in more detail in the section \ref{ch:components}.
     20Individual components of this infrastructure will be described in more detail in the section \ref{ch:infra}.
    2021
     22A number of solution evolved in the recent years.
     23The first to undertake standardization efforts for the exchange of catalog information were digital libraries.
     24
     25Z39.50 as base protocol, Worldcat, mapping/configuration files.
     26These catalogs are further described in the section \ref{sec:other-md-catalogs}
     27
     28In the recent years the evolving research infrastructures all identified a common/harmonized search as a crucial component of the system and came up with a number of solutions, however often reduced to collecting metadata, reducing to dublincore
     29and offering a lucene/solr based facetted search.
     30These catalogs are further described in the section \ref{sec:lrt-md-catalogs}.
     31
     32Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose.
    2133
    2234\subsection{Content Repositories}
     
    7991\todoin{check if relevant: http://schema.org/}
    8092
     93\subsection{Existing Crosswalk services}
     94
     95\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}
     96
     97http://semanticweb.org/wiki/VoID
     98http://www.dnb.de/rdf
     99
    81100\subsection{Ontology Visualization}
    82101
  • SMC4LRT/chapters/Results.tex

    r3240 r3551  
    1 \chapter{Evaluation}
    2 \label{ch:Evaluation}
    3 
    4 
    5 \section{Sample Queries}
    6 
    7 candidate Categories:
    8 ResourceType, Format
    9 Genre, Topic
    10 Project, Institution, Person, Publisher
    11 
    12 
    13 \section{Exploring Data Categories}
    14 In the ISOcat DCR 791 DCss are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} In the following we describe two show cases -- \textit{Language} and \textit{name} -- in more detail.
     1\chapter{Results and Findings}
     2\label{ch:results}
     3
     4In this chapter, the results of the work are presented, divided into two main areas:
     5
     6software and data.
     7
     8In two sections, we explore the CMD data domain - the usage of the data categories on the one hand and the integration of existing formats on the other hand. While these two aspects were not directly part of this work, they were a) made possible by output of this work (SMC-Browser, statistical analysis), b) yield a valuable test case for the usefulness of the work and c) are an indispensable prerequisit for the necessary curation work being carried out by the CMDI community.
     9
     10\section{Current status of the infrastructure}
     11Before we get to the results of this work,  we briefly summarize the current state of affairs within the CLARIN infrastructure at large to help contextualize the actual results.
     12
     13\subsection{CMDI - services}
     14The main services of the infrastructure have been in stable production for the last two years.
     15Relation Registry is operational as early prototype.
     16Three instances of OpenSKOS are running, one of them being hosted by ACDH.
     17
     18\subsection{CMDI - data}
     19More than 130 profiles are defined. (See \ref{table:dev_profiles} for more details about profiles.)
     20The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis.
     21The collection amounts to over 550.000 records in 64 profiles.
     22
     23\subsection{ACDH - the home of SMC}
     24Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities, that provides depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure.
     25Figure \ref{fig:acdh_context} sketches the broader context of \xne{acdh} and its different roles.
     26
     27
     28\section {Software}
     29The specification of the system can be found in the chapters \ref{ch:design} and \ref{ch:design-instance}.
     30
     31There is prototypical implementation for three parts of the system
     32
     33\begin{itemize}
     34\item the crosswalk service as a REST web service
     35\item a module to integrate with a search engine
     36\item web application that allows advanced interaction with the data set
     37\end{itemize}
     38
     39The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
     40
     41Furthermore, the CMD data has been expressed RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}.
     42
     43\subsection{SMC - crosswalks service}
     44
     45The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java.
     46
     47\subsection{SMC - as a module within Metadata Repository}
     48There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
     49
     50
     51\subsection{SMC Browser -- Advanced Interactive User Interface}
     52
     53SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph.
     54
     55It is implemented on top of the js-library d3, the code is checked in clarin-svn.
     56
     57The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
     58
     59E.g. starting from 124 profiles, this amounts to a graph with ??? nodes and ??? edges.
     60
     61\begin{figure*}[!ht]
     62\includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
     63\caption{Screenshot of the SMC browser}
     64\end{figure*}
     65
     66SMC Browser also features detailed numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation.
     67
     68In the following section, we make extensive use of the output of this tool, to visualize individual aspects of the discussed data set.
     69
     70\subsection{SMC LOD}
     71
     72
     73\section{Exploring the usage of data categories}
     74At the core of the whole SMC (and CMDI) are the data categories as basic conceptual building blocks or anchors.
     75We want to take a closer look on the usage of the data categories in the CMD infrastructure, examplifying on a few very common concepts -- \concept{language}, \concept{name}, \concept{resource type}, \concept{???}.
     76
     77In the ISOcat DCR 791 DCs are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed}
    1578
    1679\subsection{Language}
     
    3699
    37100
    38 \subsection{Name}
     101\subsection{Name / Title}
    39102There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs.
    40103Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data.
     
    46109\subsection{Subject, Genre, Topic}
    47110
    48 \section{Mapping existing Formats}
     111\section{Exploring the integration of existing formats}
     112
     113CLARIN set out with the aspiration /yearning to overcome the babylon of metadata formats
     114and its flexible CMD metamodel is specifically designed to integrate existing formats.
     115In this section, we want to elaborate on/analyze the state of integration efforts for 4 major formats: \xne{dublincore/OLAC}, \xne{teiHeader} and \xne{META-SHARE resourceInfo}.
    49116
    50117\subsection{dublincore / OLAC}
    51118
    52 Very widely used format
     119Very widely used (because) simple format
    53120\ref{info:olac-records}
    54121
    55 There are 4-5 CMD profiles modelling OLAC/dcmi-terms
    56 
     122Here the problem of proliferation seems especially virulent. Table \ref{table:dcterms-profiles} lists all the profiles modelling dcterms.
     123As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.
     124
     125\begin{figure*}[!ht]
     126\begin{center}
     127\includegraphics[width=0.5\textwidth]{images/dcmiterms-profiles.png}
     128\end{center}
     129\caption{The meanwhile four DCMI profiles with identical conceptual linking}
     130\label{fig:dcmi-profiles}
     131\end{figure*}
     132
     133
     134\begin{table}
     135\caption{Profiles modelling dublincore terms}
     136\label{table:dcterms-profiles}
     137  \begin{tabular}{ l | l | l | r | r }
     138    \hline
     139profile name & created & creator & count & instances \\
     140    \hline
     141component-dc-terms-modular & 2010-04-21 & CMDI-team & 15 / 15 / 15 \\
     142component-dc-terms & 2010-04-21 & CMDI-team & 0 / 15 / 15 \\
     143DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
     144OLAC-DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
     145OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02-12 & Menzo Windhouwer & 1 / 71 / 62 & \\
     146DC-UBU & 2013-05-29& Utrecht University Library & 0 / 15 / 15 & \\
     147OLAC-DcmiTerms-ref & 2013-06-24 & fankhauser@ids-mannheim.de & 0 / 55 / 55 & \\
     148    \hline
     149  \end{tabular}
     150\end{table}
     151
     152Additionally, there is a number of profiles with concept links to dublincore terms,
     153Some use all of the dublincore elements or terms as one component within a larger profile,
     154one example being the \xne{data} profile created by the Czech initiative LINDAT modells  the minimal obligatory set of META-SHARE \xne{resourceInfo}) combined with a simple dublincore record (see also subsection about META-SHARE below).
     155Other profiles refer only to some data categories. Most often used: \concept{Title} (used in 33 profiles) and \concept{Creator} (in 29 profiles).
     156Profiles that make more frequent use of the dublincore terms:
     157
     158\begin{itemize}
     159\item EastRepublican (8)
     160\item HZSKCorpus (17)
     161\item teiHeader (8)
     162\item ToolService (15)
     163\item OralHistoryInterviewDANS (15)
     164\end{itemize}
     165
     166\begin{figure*}[!ht]
     167\begin{center}
     168\includegraphics[width=0.8\textwidth]{images/profiles_using_dcmiterms.png}
     169\end{center}
     170\caption{Profiles referring to at least some of the dublincore data categories/terms}
     171\label{fig:profiles-using-dcmiterms}
     172\end{figure*}
    57173
    58174
     
    65181The widespread use of TEI for encoding textual resources  brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat.
    66182
    67 The large research project \xne{Deutsches Text Archiv}\furl{http://deutschestextarchiv.de/}\todocite{DTA}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information.
    68 \todoin{Why a separate cmd-profile}
    69 
    70 \xne{Nederlab} is another large-scale project concerned with \todoin{dutch? historic texts}, starting 2013 in Netherlands\todocite{Nederlab}. Within this project another set of CMD profiles was created, however reusing existing components.
    71 As seen in figure \ref{fig:teiHeadeer_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
    72 
    73 Another approach was applied within the context of other CLARIN-NL projects, \todocite{Windhouwer, 2012} generated, based on an ODD-file, a data category for every element of the teiHeader (135 datcats) creating a dedicated data category selection: \xne{TEI Header (2.1.0)}. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:components}. The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
     183The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/}\cite{Geyken2011deutsches}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.
     184Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold\cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.
     185
     186\xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however reusing existing components.
     187As seen in figure \ref{fig:teiHeader_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
     188
     189Another approach was applied within the context of other CLARIN-NL projects\cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
    74190This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
    75191
     
    87203  \begin{tabular}{ l | r | l | r | r | r}
    88204    \hline
    89 project, author & created & profile name & comp elem datcats & instances \\
    90     \hline
    91 Deutsches Text Archiv & 2012 & teiHeader & 56/82/10 & 857 \\
    92 ICLTT, Durco & 2010 & teiHeader & 16/35/13 & 467 \\
    93 Leipzig Corpora, Eckart & 2012 & TEIDocumentDescription & 16/35/13 & ? \\
    94 Nederlab, Zhang & 2013 & DBNL\_Tekst & 20/38,15 & ? \\
    95   & & DBNL\_Tekst\_Onzelfstandig & 20/47/21 & ? \\
     205profile name & created & creator & count & instances \\
     206    \hline
     207teiHeader & 2010 & ICLTT, Durco & 16/35/13 & 467 \\
     208teiHeader & 2012 & Deutsches Text Archiv & 56/82/10 & 857 \\
     209TEIDocumentDescription & 2012 & Leipzig Corpora, Eckart & 16/35/13 & ? \\
     210DBNL\_Tekst & 2013 & Nederlab, Zhang & 20/38,15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\
     211DBNL\_Tekst\_Onzelfstandig  & & & 20/47/21 & \\
    96212    \hline
    97213  \end{tabular}
    98214\end{table}
    99215
    100 \todoin{DBNL\_Tekst\_Onzelfstandig - how many instances?}
    101 
    102216DBNL\_Tekst clarin.eu:cr1:p\_1361876010678,
    103217clarin.eu:cr1:p 1366279029218 (private)
     
    108222META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
    109223%In cooperation between metadata teams from CLARIN and META-SHARE
    110 The model has been expressed as 4 CMD profiles for distinct resource types sharing most of the components. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 419 components and 1587 elements (when expanded). Although most of the elements are optional
    111 
    112 resourceInfo    419     1587    72      790     797     50.22 %
    113 \todoin{how many distinct components/elements}
    114 This? shows nicely the trade-off between the two different approaches between CMD and META-SHARE: many custom schemas or one very large.
    115 
    116 In a parallel effort, LINDAT, the czech national infrastructure initiative with ties to both CLARIN and META-SHARE, created a CMD profile modelling the minimal obligatory set of META-SHARE. combined with dublincore.
    117 So the information is partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema
    118 
    119 resourceInfo    65      92      21      82      10      10.87 %
    120 
     224
     225\begin{figure*}[!ht]
     226\begin{center}
     227\includegraphics[width=0.5\textwidth]{images/SMC-resourceInfo.png}
     228\end{center}
     229\caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
     230\label{fig:resource_info_5}
     231\end{figure*}
     232
     233\begin{table}
     234\caption{Profiles modelling resourceInfo}
     235\label{table:resourceinfo-profiles}
     236  \begin{tabular}{ l | l | l | r | r }
     237    \hline
     238profile name & created & creator & count & instances \\
     239    \hline
     240resourceInfo (minimal) & 2013-02-13 & LINDAT.CZ & 34 / 41 / 21 \\
     241resourceInfo (lexical) & 2013-06-02 & P. Labropoulou & 86 / 226 / 57 \\
     242resourceInfo (tools) & 2013-06-02 & P. Labropoulou & 61 / 176 / 52 \\
     243resourceInfo (language) & 2013-06-02 & P. Labropoulou & 89 / 228 / 54 \\
     244resourceInfo (corpus) & 2013-06-02 & P. Labropoulou & 117 / 337 / 72 \\
     245    \hline
     246  \end{tabular}
     247\end{table}
     248
     249The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
     250
     251In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\xne{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \xne{resourceInfo}), however combined with a simple dublincore record.
     252This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
    121253
    122254\begin{figure*}[!ht]
     
    137269
    138270
    139 \section{Summary}
    140 
    141 
    142 
    143 \chapter{Results}
    144 \label{ch:results}
    145 
    146 
    147 \section { Software module}
    148 
    149 The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
    150 
    151 
    152 \subsection{SMC Browser -- Advanced Interactive User Interface}
    153 
    154 Explore the Component Metadata Framework
    155 
    156 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010).
    157 
    158 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
    159 
    160 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration.
    161 
    162 It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon.
    163 
    164 The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset.
     271
     272\section{Evaluation}
     273\label{evaluation}
     274
     275Sample Queries:
     276
     277candidate Categories:
     278ResourceType, Format
     279Genre, Topic
     280Project, Institution, Person, Publisher
     281
     282
     283
     284\subsection{Use Cases}
     285
     286\begin{itemize}
     287
     288\item MD Search employing Semantic Mapping
     289\item MD Search employing Fuzzy Search
     290\end{itemize}
    165291
    166292
     
    173299\section{Summary}
    174300
    175 
    176 \begin{figure*}[!ht]
    177 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
    178 \caption{Screenshot of the SMC browser}
    179 \end{figure*}
    180 
    181 
     301The direct comparison of the CMD approach of metamodel allowing to generate custom profiles with shared semantics and a more traditional way of trying to generate one schema to fit all in as in META-SHARE shows nicely the trade-off: many custom schemas or one very large.
     302
  • SMC4LRT/chapters/appendix.tex

    r3240 r3551  
    99\includegraphics[width=1\textwidth]{images/DCR_data_model.jpg}
    1010\end{center}
    11 \caption{DCR data model}
     11\caption{DCIF -- the data model for the Data Category Registry as defined by the ISO Standard ISO12620:2009 \cite{ISO12620:2009}}
    1212\label{fig:DCR_data_model}
    1313\end{figure*}
    14 \todocite{DCR data model}
    1514
    1615\begin{figure*}[!ht]
     
    2120\label{fig:ref_arch}
    2221\end{figure*}
     22
     23\begin{figure*}[!ht]
     24\begin{center}
     25\includegraphics[width=1\textwidth]{images/acdh-diagram_300dpi_rotated.png}
     26\end{center}
     27\caption{Austrian Centre for Digital Humanities - the home of SMC - in context}
     28\label{fig:acdh_context}
     29\end{figure*}
     30
     31\section {SMC Reports}
     32\label{sec:reports}
     33
     34SCM Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
     35
     36
     37\input{chapters/examples_cleaned}
Note: See TracChangeset for help on using the changeset viewer.