Changeset 2671


Ignore:
Timestamp:
03/10/13 21:06:32 (11 years ago)
Author:
vronk
Message:

mostly outsourcing individual chapters to separate tex-files

Location:
SMC4LRT
Files:
3 added
5 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Evaluation.tex

    r2669 r2671  
    11\section{Evaluation}
     2
     3\subsection{Use Cases}
     4
     5\begin{itemize}
     6
     7\item MD Search employing Semantic Mapping
     8\item MD Search employing Fuzzy Search
     9\item Visualization of the Results - ?
     10\end{itemize}
     11
     12A trivial example for a concept-based query expansion:
     13Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
     14\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
     15
     16Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
     17
    218
    319\subsection{Research Questions }
  • SMC4LRT/Introduction.tex

    r2669 r2671  
    66
    77This work proposes a component that shall enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
     8
     9Alternatively/  that allows query expansion by providing mappings between search indexes. This enables semantic search, ultimately increasing the recall when searching in metadata collections. The module builds on the Data Category Registry and Component Metadata Framework that are part of CMDI.
     10
    811
    912Following two examples for better illustration. First a concept-based query expansion:
     
    2730ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}).
    2831
    29 Finally in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
     32Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
    3033in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
    3134
     
    6063
    6164Ontology Visualization
    62 
    63 Federated Search, Distributed Content Search
    64 (ILS - Integrated Library Systems)
  • SMC4LRT/Literature.tex

    r2669 r2671  
    44
    55\subsection*{Infrastructure Components}
    6 There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
     6In recent years, multiple large-scale initiatives have been set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular. A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder+2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder+2010}.
    77
    8 \texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}}
    9 are two integral components of the \textit{CLARIN Metadata Infrastructure} maintaining the normative information. Especially \texttt{ISOcat} -- the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse -- is the definitive primary reference vocabulary \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase.
     8Individual components of this infrastructure will be described in more detail in the section \ref{components}.
    109
    11 And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
    12 
    13 \noindent
    14 All these components are running services, that this work shall directly build upon.
    1510
    1611\subsection*{LRT  Resources}
  • SMC4LRT/Outline.tex

    r2669 r2671  
    6161\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
    6262%\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
    63 %\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
     63%\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape} 
    6464%\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
    6565
     
    8989\begin{description}
    9090\item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
    91 \item[Ontology]  "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
     91\item[Ontology]  "an explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
    9292\item[Word]  a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept)
    9393\item[Lexicon]  a collection of words, a (lexical) vocabulary
     
    104104\end{description}
    105105
    106 \section{Analysis}
     106\include{Data}
    107107
    108 \subsection{Data landscape}
     108\include{Infrastructure}
    109109
    110 Describe situation regarding the datasets and formats
    111 
    112 collections, profiles/Terms, ResourceTypes!
    113 
    114 DC, OLAC,
    115 ISLE/IMDI, CHILDES, TEI, EAF!
    116 (CES/XCES)
    117 
    118 \subsection{Infrastructure}
    119 
    120 CMDI \cite{Broeder2010}
    121 
    122 
    123 \subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems}
    124 
    125 
    126 \subsubsection{Classification Schemes, Taxonomies }
    127 LCSH, DDC
    128 
    129 
    130 \subsubsection{Other controlled Vocabularies}
    131 Tagsets: STTS
    132 Language codes ISO-639-1
    133 
    134 \subsubsection{Domain Ontologies, Vocabularies}
    135 Organization-Lists
    136 LT-World !?
    137 
    138 
    139 \subsection{Use Cases}
    140 
    141 \begin{itemize}
    142 
    143 \item MD Search employing Semantic Mapping
    144 \item MD Search employing Fuzzy Search
    145 \item Content Search
    146 \item Combined MEtadata Content Search
    147 \item Visualization of the Results - charts on facets/dimensions
    148 
    149 \item  Create and publish Virtual Collection based on complex Search (intensional/extensional)
    150 \item  Let Create ad-hoc corpus
    151 \end{itemize}
    152 
    153 A trivial example for a concept-based query expansion:
    154 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
    155 \texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
    156 
    157 Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
    158 
    159 \section{Semantic Mapping}
    160 
    161 
    162 \subsection{Profiles to Data Categories}
    163 CMD:Profile.Comp.Elem -> DatCat
    164 
    165 
    166 \subsection{Semantic Relations between (Data)Categories}
    167 
    168 Relation Registry
    169 
    170 !check DCR-RR/Odijk2010 -follow up
    171 !Cf. Erhard Hinrichs 2009
    172 
    173 
    174 \subsection{Mapping from strings to Entities}
    175 
    176 Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
    177 
    178 Identify related ontologies:
    179 LT-World \cite{Joerg2010}
    180 
    181 task:
    182 \begin{enumerate}
    183 \item  express MDRecords in RDF
    184 \item  identify related ontologies/vocabularies (category -> vocabulary)
    185 \item  implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
    186 
    187 \fbox{  function lookup: Category x String -> ConceptualDomain}
    188 
    189 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
    190 \end{enumerate}
    191 
    192 
    193 
    194 \subsection{Semantic Search}
    195 
    196 Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
    197 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
    198 with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
    199 
    200 In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
    201 Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
    202 
    203 Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
    204 
    205 ?
    206 Facets
    207 Controlled Vocabularies
    208 Synonym Expansion (via TermExtraction(ContentSet))
    209 
    210 \subsection{Linked Data - Express dataset in RDF}
    211 
    212 Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
    213 So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
    214 
    215 Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
    216 But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
    217 
    218 Technical aspects (RDF-store?) / interface (ontology browser?)
    219 
    220 defining the Mapping:
    221 \begin{enumerate}
    222 \item convert to RDF
    223 translate: MDREcord -> [\#mdrecord \#property literal]
    224 \item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
    225 \end{enumerate}
    226 
    227 \subsection{Content/Annotation}
    228 AF + DCR + RR
    229 
    230 
    231 \subsection{Visualization}
    232 Landscape, Treemap, SOM
    233 
    234 Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
    235 
    236 
     110\include{SMC}
    237111
    238112\include{System}
     
    242116
    243117\section{Conclusions and Future Work}
     118
     119The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
     120
     121Further work is needed on more complex types of response (similarity ratio, relation types) and also on the interaction with Metadata Service to find the optimal way of providing the features of semantic mapping and query expansion as semantic search within the search user-interface.
     122
    244123
    245124\section{Questions, Remarks}
     
    252131
    253132
    254 \bibliographystyle{ieee}
    255 \bibliography{../../../2bib/lingua,../../../2bib/ontolingua}
     133\bibliographystyle{ieeetr}
     134\bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb}
    256135
    257136
  • SMC4LRT/System.tex

    r2669 r2671  
    1 \section{System Design}
    2 SOA
     1\section{?? System}
     2SOA?
     3
    34
    45\subsection{DataModel}
     
    910RDF
    1011
    11 \subsection{Architecture}
    12 
    13 Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
    14 
    15 \begin{itemize}
    16 \item Data Category REgistry,
    17 \item Relation Registry
    18 \item Component Registry
    19 \item Vocabulary Alignement Service (OpenSKOS)
    20 \item SchemaParser
    21 \end{itemize}
    22 merging the pieces of information provided by those,
    23 offering them semi-transaprently to the user (or application) on the consumption side.
     12\subsection{Query Language}
     13CQL?
    2414
    2515
    26 \subsection{CMDI}
     16\subsection*{Implementation}
    2717
    28 MDBrowser
    29 MDService
     18The core function of the SMC is being implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
    3019
    31 \subsection{Query Language}
    32 CQL?
     20
     21\subsubsection{smc init}
     22
     23\subsubsection{smc browser}
     24
     25\subsubsection{smc as mdrepo module}
     26
     27\subsubsection{smc as VAS}
     28
     29
    3330
    3431\subsection{User Interface}
Note: See TracChangeset for help on using the changeset viewer.