Changeset 4117 for SMC4LRT/chapters/Introduction.tex
- Timestamp:
- 12/01/13 19:04:51 (11 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Introduction.tex
r3776 r4117 4 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5 5 6 \section{Motivation / problem statement}6 \section{Motivation / Problem Statement} 7 7 8 8 While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.) 9 9 10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} ( cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (CMDI, cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. 11 11 12 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.12 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} (SMC) -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources. 13 13 14 14 \section{Main Goal} … … 40 40 Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search. 41 41 42 Thus, the goal is not primarily to produce the crosswalks but rather to develop theservice serving existing ones.42 Thus, the goal is not primarily to define new crosswalks but rather to develop a service serving existing ones. 43 43 44 44 \subsubsection*{Concept-based query expansion} … … 48 48 \paragraph{Example} 49 49 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to 50 all the semantically near fields (\emph{concept cluster}) , that are howeverlabelled (or even structured) differently in other schemas like:50 all the semantically near fields (\emph{concept cluster}) that are however, labelled (or even structured) differently in other schemas like: 51 51 52 52 \begin{quote} … … 54 54 \end{quote} 55 55 56 The expansion cannot be solved by simple string matching, as there are other fields label ed with the same (sub)strings but with different semantics,that shouldn't be considered:56 The expansion cannot be solved by simple string matching, as there are other fields labelled with the same (sub)strings but with different semantics that shouldn't be considered: 57 57 58 58 \begin{quote} … … 62 62 \subsubsection*{Semantic interpretation} 63 63 64 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance datashows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.64 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the evidence in the metadata records collected within CMDI shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies. 65 65 66 66 \subsubsection*{Ontology-driven data exploration} … … 75 75 76 76 \section{Method} 77 We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.77 We start with examining the existing data and with the description of the existing infrastructure, in which this work is embedded. 78 78 79 79 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure. … … 90 90 Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}. 91 91 92 A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.92 A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however, this issue can only be tackled marginally and will have to be outsourced into future work. 93 93 94 94 \section{Expected Results} … … 108 108 \end{description} 109 109 110 \section{Structure of the work}110 \section{Structure of the Work} 111 111 The work starts with examining the state of the art work in the two fields language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work. 112 112 … … 116 116 The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future. 117 117 118 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).118 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref}) and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}). 119 119 120 120
Note: See TracChangeset
for help on using the changeset viewer.