Changeset 3234
- Timestamp:
- 08/05/13 13:25:21 (11 years ago)
- Location:
- SMC4LRT/chapters
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Infrastructure.tex
r3204 r3234 1 1 \chapter{Underlying infrastructure} 2 \label{ch: components}2 \label{ch:infra} 3 3 4 4 -
SMC4LRT/chapters/Introduction.tex
r3204 r3234 4 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5 5 6 \todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/}7 8 9 6 \section{Motivation / problem statement} 10 7 11 While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardiz ing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types combined with project-specific needs. (See chapter \ref{ch:data} for analysis of the data disparity in thedomain)8 While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (chapter \ref{ch:data} analyses the disparity in the data domain) 12 9 13 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at supporting research communities developing common large-scale pan-european infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common infrastructure. One core pillar of this infrastructure is a common framework for resource descriptions (metadata) the Component Metadata Framework (cf. \ref{def:cmdi}). This work discusses the component within the Component Metadata Infrastructure concerned with /dedicated to overcoming the heterogenity of the resource descriptions. 10 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at fostering the development of common large-scale international infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common harmonized architecture for accessing and working with LRT. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:cmdi}) 11 -- a distributed system consisting of multiple interconnected applications aimed at creating and providing metadata for lLRT in a coherent harmonized way. 14 12 13 This work discusses a module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogenity of the resource descriptions, without the reductionist approach of trying to impose one common description schema for all resources. 15 14 16 15 \section{Main Goal} 17 16 18 The primary goal of this work is to enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.17 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of Language Resources and Technology (LRT), henceforth referred to as \emph{semantic search} , distincting it from the necessary underlying processing, referred to as \emph{semantic mapping}. 19 18 20 This main goal can be broken down along the three meanings of the term ``mapping'': 19 The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work, 20 that also translate into three corresponding subgoals: 21 21 22 22 \begin{description} 23 \item[crosswalk] establish links between crosswalks betweenmetadata formats23 \item[crosswalk] link related fields in different metadata formats 24 24 \item[interpret] translate string labels in field values to semantic entities 25 \item[visualize] provide a visualization of thedata.25 \item[visualize] provide appropriate means to explore the domain data. 26 26 \end{description} 27 27 28 Alternatively/ that allows query expansion by providing mappings between search indexes. This enables semantic search, ultimately increasing the recall when searching in metadata collections. The module builds on the Data Category Registry and Component Metadata Framework that are part of CMDI.28 The work can further be divided along the schema / instance duality/dimension. Figure \ref{fig:master_outline} sketches the goals / conceptual space of this thesis. 29 29 30 %\includegraphics[width=\unitlength]{images/master_outline.eps} 31 \label{fig:master_outline} 32 \input{images/master_outline.eps_tex} 30 33 31 Following two examples for better illustration. First a concept-based query expansion: 32 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is synonym to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like: 34 \subsubsection*{Crosswalks} 35 Goal is not primarily to produce the crosswalks but rather to develop the service serving them. 36 37 ??? 38 39 While this may seem a rather trivial task, it is not if we consider the heterogeneity and complexity of the dataset, 40 further complicated by the fact, that this shall be community-driven process, without a central authority defining the relations 41 and that there may be even need for different relation sets for different tasks. In fact, a number of modules of the discussed infrastructure are dedicated to overcoming the semantic interoperability problem. 42 43 \subsubsection*{Concept-based query expansion} 44 45 Once the crosswalks are available, they can be used to expand/translate user queries, to match related fields across heterogeneous metadata formats, resulting in higher recall. 46 47 \paragraph{Example} 48 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be expanded to 49 all the semantically near fields (concept cluster), that are however labelled (or even structured) differently in other formats like 50 51 \begin{quote} 52 \concept{resourceTitle, BookTitle, tei:titleStmt, Corpus/GeneralInfo/Name} 53 \end{quote} 54 55 but probably not to other fields, using same (sub)strings for the field labels 56 but with different semantics, like: 57 33 58 \begin{quote} 34 \ texttt{Actor.Name = Sue OR Actor.FullName = Sue OR \\ Person.Name = Sue OR Person.FullName = Sue}59 \concept{Project/Title, Organisation/Name, Country/Name} 35 60 \end{quote} 36 And second, an ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.37 61 38 Such \textbf{semantic search} functionality requires a preprocessing step, that produces the underlying linkage both between categories/concepts and on the instance level. We refer to this task as \textbf{semantic mapping}, that shall be realized by corresponding \texttt{Semantic Mapping Component}. In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the component -- rather than trying to establish a final, accomplished alignment. Although a tentative, na\"ive mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating the actual sensible mappings usable for real tasks. 62 \subsubsection*{Semantic interpretation} 39 63 40 In fact, due to the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on ``soft'' dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively manipulate the recall/precision ratio of the search results. This entails an examination of user interaction with and visualization of the relevant additional information in the user search interface. However this would open doors to a whole new (to this work) field of usability engineering and can be treated here only marginally. 64 The problem of different labels for semantically similar or even identical things is even more so virulent on the level of individual values in the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly/exhaustively enumerated. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. 65 66 \subsubsection*{Ontology-driven search / data exploration} 67 68 By applying semantic web technologies, the user will be given new means of \emph{exploring the dataset} through semantic resources (ontology-driven search/browsing/exploration). 69 70 \paragraph{Example} 71 Ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources. 72 73 \subsubsection*{Visualization} 74 Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means for exploration of and interaction with it. Hence this subgoal aiming at exploring ways of visualizing the data at hand. 41 75 42 76 \section{Method} 43 We start with examining the existing data and describing the evolving infrastructure in which the components are to be embedded. Then we formulate the function of \textbf{Semantic Search} distinguishing between the concept level -- using semantic relations between concepts or categories for better retrieval -- and the instances level -- allowing the user to explore the primary data collection via semantic resources (ontologies, vocabularies). 77 The primary concern of this work is the integrative effort, i.e. bringing together existing pieces (resources, components and methods). We start with examining the existing data and the description of the evolving infrastructure in which this work is embedded. 44 78 45 Subsequently we introduce the underlying \textbf{Semantic Mapping Component} again distinguishing the two levels - concepts and instances. We describe the workflow and the central methods, building upon the existing pieces of the infrastructure (See \textit{Infrastructure Components} in \ref{SotA} ). A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.79 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure. 46 80 47 In the practical part - processing the data - a necessary prerequisite is the dataset being expressed in RDF. 48 Independently, starting from a survey of existing semantic resources (ontologies, vocabularies), we identify an intial set of relevant 49 ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}). 81 Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories. 50 82 51 Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation 52 in which we apply a set of test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures. A separate evaluation of the usability of the Semantic Search component is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work. 83 This theoretical part will be accompanied by a prototypical implementation as proof of concept. 53 84 54 \begin{itemize} 55 \item a) define/use semantic relations between categories (RelationRegistry) 56 \item b) employ ontological resources to enhance search in the dataset (SemanticSearch) 57 \item c) specify a translation instructions for expressing dataset in rdf (LinkedData) 58 \end{itemize} 85 In an evaluation phase, we apply a set of test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures. 86 87 In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the module -- rather than trying to establish final, accomplished alignment of the schemas. Although a tentative mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating further, more comprehensive mappings. 88 89 In fact, due to the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on ``soft'' dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively influence the recall/precision ratio of the search results. 90 91 \begin{note} 92 A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings. 93 94 especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT). 95 \end{note} 96 97 Serving the second subgoal, semantic interpretation on the instance level, we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external 98 semantic resources (controlled vocabularies, ontologies). 99 Once the dataset is expressed in RDF, it can be exposed via a semantic web application and publicized as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}. 100 101 A separate usability evaluation of the semantic search is indicated examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work. 59 102 60 103 \section{Expected Results} 61 The primary concern of this work is the integrative effort, i.e. putting together existing pieces (resources, components and methods) especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT). Thus the main result of this work will be the \emph{specification} of the two components \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.62 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.63 104 64 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}. 105 The main result of this work will be the \emph{specification} of the two modules \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}. 106 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components 107 and the results and findings of the \emph{evaluation}. 65 108 109 Another result of the work will be the original dataset expressed as RDF interlinked with existing external resources (ontologies, knowledge bases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}. 66 110 67 111 \begin{description} 68 \item [Specification] definition of the mapping mechanism 112 \item [Specification Semantic Mapping] design of the mapping mechanism 113 \item [Specification Semantic Search] design of the query expansion and integration with search engines 69 114 \item [Prototype] proof of concept implementation 70 115 \item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search … … 73 118 74 119 \section{Structure of the work} 75 The work starts with examining the state of the art work in the two fields language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{cd:data} we analyze the situation in the data domain of LRT metadata and 76 in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work. 120 The work starts with examining the state of the art work in the two fields language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the terms and abbreviations used in the work. 77 121 78 The main part of the work is found in chapters \ref{ch:design}, \ref{ch:cmd2rdf} and \ref{ch:implementation} 122 In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work. 79 123 80 The findings and results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.124 The main part of the work is found in chapters \ref{ch:design}, \ref{ch:implementation} and \ref{ch:cmd2rdf} laying out the design of the software module, the proposal how to modell the data in RDF and the possibilities of visualization respectively. 81 125 126 The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future. 82 127 83 128 \section{Keywords}
Note: See TracChangeset
for help on using the changeset viewer.