1 | \chapter{Underlying infrastructure}\label{ch:components} |
---|
2 | |
---|
3 | As stated before, the proposed module is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{method}, we introduce in short these modules and the data they provide: |
---|
4 | |
---|
5 | \begin{itemize} |
---|
6 | \item Data Category Registry, |
---|
7 | \item Relation Registry |
---|
8 | \item Component Registry |
---|
9 | \item Vocabulary Alignement Service (OpenSKOS) |
---|
10 | \item SchemaParser |
---|
11 | \end{itemize} |
---|
12 | |
---|
13 | ?MDBrowser |
---|
14 | ?MDService |
---|
15 | |
---|
16 | |
---|
17 | \begin{figure*}[!ht] |
---|
18 | \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2} |
---|
19 | \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping} |
---|
20 | \end{figure*} |
---|
21 | |
---|
22 | \subsection{CMDI - Production side} |
---|
23 | |
---|
24 | The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework. |
---|
25 | The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \emph{ISOcat}\footnote{\url{http://www.isocat.org/}}. |
---|
26 | %Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable. |
---|
27 | |
---|
28 | The \emph{Component Metadata Framework} (CMD) is built on top of the DCR and complements it. While the DCR defines the atomic concepts, within CMD the metadata schemas can be constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles as long as each field "refers via a PID to exactly one data category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be interpreted" \cite{Broeder+2010}. This allows to trivially infer equivalencies between metadata fields in different CMD-based schemas. While the primary registry used in CMD is the ISOcat DCR, other authoritative sources for data categories ("trusted registries") are accepted, especially Dublin Core Metadata Initiative \cite{DCMI:2005}. |
---|
29 | % \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles. |
---|
30 | |
---|
31 | The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions. |
---|
32 | However there needs to be an additional means to capture information about relations between data categories. |
---|
33 | This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler. |
---|
34 | % These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed. |
---|
35 | |
---|
36 | There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen. \cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}. |
---|
37 | This implementation stores the individual relations as RDF-triples |
---|
38 | \begin{eqnarray*} |
---|
39 | <subjectDatacat, relationPredicate, objectDatcat> |
---|
40 | \end{eqnarray*} |
---|
41 | allowing typed relations, like equivalency (\texttt{rel:sameAs}) and subsumption (\texttt{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. |
---|
42 | |
---|
43 | !check DCR-RR/Odijk2010 -follow up |
---|
44 | !Cf. Erhard Hinrichs 2009 |
---|
45 | |
---|
46 | And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project. |
---|
47 | |
---|
48 | \noindent |
---|
49 | All these components are running services, that this work shall directly build upon. |
---|
50 | |
---|
51 | This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs |
---|
52 | from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}. |
---|
53 | |
---|
54 | Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this novel mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}. |
---|
55 | |
---|
56 | |
---|
57 | \subsection{CMDI - Exploitation side} |
---|
58 | Metadata complying to the CMD-framework is being created by a growing number of institutions by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todo{What about Normalization?}. and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing. |
---|
59 | |
---|
60 | \begin{figure*}[!ht] |
---|
61 | \includegraphics[width=1\textwidth]{images/CMDingestion_woVAS} |
---|
62 | \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by exploitation side components} |
---|
63 | \end{figure*} |
---|
64 | |
---|
65 | |
---|
66 | The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings. |
---|
67 | |
---|
68 | More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todo { describe indexing and search} |
---|
69 | \todo { add citation} |
---|
70 | |
---|
71 | And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers, |
---|
72 | and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011} |
---|
73 | |
---|