Changeset 2671
- Timestamp:
- 03/10/13 21:06:32 (11 years ago)
- Location:
- SMC4LRT
- Files:
-
- 3 added
- 5 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/Evaluation.tex
r2669 r2671 1 1 \section{Evaluation} 2 3 \subsection{Use Cases} 4 5 \begin{itemize} 6 7 \item MD Search employing Semantic Mapping 8 \item MD Search employing Fuzzy Search 9 \item Visualization of the Results - ? 10 \end{itemize} 11 12 A trivial example for a concept-based query expansion: 13 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like: 14 \texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name = Sue OR Person.FullName= is Sue} 15 16 Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset. 17 2 18 3 19 \subsection{Research Questions } -
SMC4LRT/Introduction.tex
r2669 r2671 6 6 7 7 This work proposes a component that shall enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing. 8 9 Alternatively/ that allows query expansion by providing mappings between search indexes. This enables semantic search, ultimately increasing the recall when searching in metadata collections. The module builds on the Data Category Registry and Component Metadata Framework that are part of CMDI. 10 8 11 9 12 Following two examples for better illustration. First a concept-based query expansion: … … 27 30 ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}). 28 31 29 Finally in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation32 Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation 30 33 in which we apply a set of test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work. 31 34 … … 60 63 61 64 Ontology Visualization 62 63 Federated Search, Distributed Content Search64 (ILS - Integrated Library Systems) -
SMC4LRT/Literature.tex
r2669 r2671 4 4 5 5 \subsection*{Infrastructure Components} 6 There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.6 In recent years, multiple large-scale initiatives have been set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular. A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder+2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder+2010}. 7 7 8 \texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}} 9 are two integral components of the \textit{CLARIN Metadata Infrastructure} maintaining the normative information. Especially \texttt{ISOcat} -- the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse -- is the definitive primary reference vocabulary \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase. 8 Individual components of this infrastructure will be described in more detail in the section \ref{components}. 10 9 11 And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.12 13 \noindent14 All these components are running services, that this work shall directly build upon.15 10 16 11 \subsection*{LRT Resources} -
SMC4LRT/Outline.tex
r2669 r2671 61 61 \usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC 62 62 %\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents 63 %\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape} 63 %\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape} 64 64 %\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold! 65 65 … … 89 89 \begin{description} 90 90 \item[Concept] sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build 91 \item[Ontology] "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.91 \item[Ontology] "an explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words. 92 92 \item[Word] a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept) 93 93 \item[Lexicon] a collection of words, a (lexical) vocabulary … … 104 104 \end{description} 105 105 106 \ section{Analysis}106 \include{Data} 107 107 108 \ subsection{Data landscape}108 \include{Infrastructure} 109 109 110 Describe situation regarding the datasets and formats 111 112 collections, profiles/Terms, ResourceTypes! 113 114 DC, OLAC, 115 ISLE/IMDI, CHILDES, TEI, EAF! 116 (CES/XCES) 117 118 \subsection{Infrastructure} 119 120 CMDI \cite{Broeder2010} 121 122 123 \subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems} 124 125 126 \subsubsection{Classification Schemes, Taxonomies } 127 LCSH, DDC 128 129 130 \subsubsection{Other controlled Vocabularies} 131 Tagsets: STTS 132 Language codes ISO-639-1 133 134 \subsubsection{Domain Ontologies, Vocabularies} 135 Organization-Lists 136 LT-World !? 137 138 139 \subsection{Use Cases} 140 141 \begin{itemize} 142 143 \item MD Search employing Semantic Mapping 144 \item MD Search employing Fuzzy Search 145 \item Content Search 146 \item Combined MEtadata Content Search 147 \item Visualization of the Results - charts on facets/dimensions 148 149 \item Create and publish Virtual Collection based on complex Search (intensional/extensional) 150 \item Let Create ad-hoc corpus 151 \end{itemize} 152 153 A trivial example for a concept-based query expansion: 154 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like: 155 \texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name = Sue OR Person.FullName= is Sue} 156 157 Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset. 158 159 \section{Semantic Mapping} 160 161 162 \subsection{Profiles to Data Categories} 163 CMD:Profile.Comp.Elem -> DatCat 164 165 166 \subsection{Semantic Relations between (Data)Categories} 167 168 Relation Registry 169 170 !check DCR-RR/Odijk2010 -follow up 171 !Cf. Erhard Hinrichs 2009 172 173 174 \subsection{Mapping from strings to Entities} 175 176 Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies. 177 178 Identify related ontologies: 179 LT-World \cite{Joerg2010} 180 181 task: 182 \begin{enumerate} 183 \item express MDRecords in RDF 184 \item identify related ontologies/vocabularies (category -> vocabulary) 185 \item implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?) 186 187 \fbox{ function lookup: Category x String -> ConceptualDomain} 188 189 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc. 190 \end{enumerate} 191 192 193 194 \subsection{Semantic Search} 195 196 Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources. 197 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, 198 with which the data will then be linked. These could be for example ontologies of Organizations and Projects. 199 200 In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user. 201 Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user. 202 203 Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily. 204 205 ? 206 Facets 207 Controlled Vocabularies 208 Synonym Expansion (via TermExtraction(ContentSet)) 209 210 \subsection{Linked Data - Express dataset in RDF} 211 212 Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with 213 So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud. 214 215 Practically this won't be that straight-forward as the mapping to entities will be a hell of a work. 216 But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy. 217 218 Technical aspects (RDF-store?) / interface (ontology browser?) 219 220 defining the Mapping: 221 \begin{enumerate} 222 \item convert to RDF 223 translate: MDREcord -> [\#mdrecord \#property literal] 224 \item map: \#mdrecord \#property literal -> [\#mdrecord \#property \#entity] 225 \end{enumerate} 226 227 \subsection{Content/Annotation} 228 AF + DCR + RR 229 230 231 \subsection{Visualization} 232 Landscape, Treemap, SOM 233 234 Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf 235 236 110 \include{SMC} 237 111 238 112 \include{System} … … 242 116 243 117 \section{Conclusions and Future Work} 118 119 The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN Metadata Service, its primary consuming service, but shall be equally usable by other applications. 120 121 Further work is needed on more complex types of response (similarity ratio, relation types) and also on the interaction with Metadata Service to find the optimal way of providing the features of semantic mapping and query expansion as semantic search within the search user-interface. 122 244 123 245 124 \section{Questions, Remarks} … … 252 131 253 132 254 \bibliographystyle{ieee }255 \bibliography{../../ ../2bib/lingua,../../../2bib/ontolingua}133 \bibliographystyle{ieeetr} 134 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb} 256 135 257 136 -
SMC4LRT/System.tex
r2669 r2671 1 \section{System Design} 2 SOA 1 \section{?? System} 2 SOA? 3 3 4 4 5 \subsection{DataModel} … … 9 10 RDF 10 11 11 \subsection{Architecture} 12 13 Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}: 14 15 \begin{itemize} 16 \item Data Category REgistry, 17 \item Relation Registry 18 \item Component Registry 19 \item Vocabulary Alignement Service (OpenSKOS) 20 \item SchemaParser 21 \end{itemize} 22 merging the pieces of information provided by those, 23 offering them semi-transaprently to the user (or application) on the consumption side. 12 \subsection{Query Language} 13 CQL? 24 14 25 15 26 \subsection {CMDI}16 \subsection*{Implementation} 27 17 28 MDBrowser 29 MDService 18 The core function of the SMC is being implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}. 30 19 31 \subsection{Query Language} 32 CQL? 20 21 \subsubsection{smc init} 22 23 \subsubsection{smc browser} 24 25 \subsubsection{smc as mdrepo module} 26 27 \subsubsection{smc as VAS} 28 29 33 30 34 31 \subsection{User Interface}
Note: See TracChangeset
for help on using the changeset viewer.