Context Navigation

← Previous Changeset
Next Changeset →

Changeset 2671

Timestamp:

03/10/13 21:06:32 (11 years ago)

Author:

vronk

Message:

mostly outsourcing individual chapters to separate tex-files

Location:

SMC4LRT

Files:

: 3 added
: 5 edited

Data.tex (added)
Evaluation.tex (modified) (1 diff)
Infrastructure.tex (added)
Introduction.tex (modified) (3 diffs)
Literature.tex (modified) (1 diff)
Outline.tex (modified) (5 diffs)
SMC.tex (added)
System.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/Evaluation.tex

-                      r2669
+                      r2671
 \section{Evaluation}
+\subsection{Use Cases}
+\begin{itemize}
+\item MD Search employing Semantic Mapping
+\item MD Search employing Fuzzy Search
+\item Visualization of the Results - ?
+\end{itemize}
+A trivial example for a concept-based query expansion:
+Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
+\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
+Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
 \subsection{Research Questions }

SMC4LRT/Introduction.tex

-                      r2669
+                      r2671
 This work proposes a component that shall enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
+Alternatively/  that allows query expansion by providing mappings between search indexes. This enables semantic search, ultimately increasing the recall when searching in metadata collections. The module builds on the Data Category Registry and Component Metadata Framework that are part of CMDI.
 Following two examples for better illustration. First a concept-based query expansion:
 …
 ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}).
 Finally in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
+Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
 in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
 …
 Ontology Visualization
-Federated Search, Distributed Content Search
-(ILS - Integrated Library Systems)

SMC4LRT/Literature.tex

-                      r2669
+                      r2671
 \subsection*{Infrastructure Components}
 There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
+In recent years, multiple large-scale initiatives have been set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular. A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder+2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder+2010}.
+\texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}}
+are two integral components of the \textit{CLARIN Metadata Infrastructure} maintaining the normative information. Especially \texttt{ISOcat} -- the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse -- is the definitive primary reference vocabulary \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase.
+Individual components of this infrastructure will be described in more detail in the section \ref{components}.
-And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
-\noindent
-All these components are running services, that this work shall directly build upon.
 \subsection*{LRT  Resources}

SMC4LRT/Outline.tex

-                      r2669
+                      r2671
 \usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
 %\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
 %\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
+%\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
 %\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
 …
 \begin{description}
 \item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
 \item[Ontology]  "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
+\item[Ontology]  "an explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
 \item[Word]  a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept)
 \item[Lexicon]  a collection of words, a (lexical) vocabulary
 …
 \end{description}
 \section{Analysis}
+\include{Data}
 \subsection{Data landscape}
+\include{Infrastructure}
+Describe situation regarding the datasets and formats
+collections, profiles/Terms, ResourceTypes!
+DC, OLAC,
+ISLE/IMDI, CHILDES, TEI, EAF!
+(CES/XCES)
+\subsection{Infrastructure}
+CMDI \cite{Broeder2010}
+\subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems}
+\subsubsection{Classification Schemes, Taxonomies }
+LCSH, DDC
+\subsubsection{Other controlled Vocabularies}
+Tagsets: STTS
+Language codes ISO-639-1
+\subsubsection{Domain Ontologies, Vocabularies}
+Organization-Lists
+LT-World !?
+\subsection{Use Cases}
+\begin{itemize}
+\item MD Search employing Semantic Mapping
+\item MD Search employing Fuzzy Search
+\item Content Search
+\item Combined MEtadata Content Search
+\item Visualization of the Results - charts on facets/dimensions
+\item  Create and publish Virtual Collection based on complex Search (intensional/extensional)
+\item  Let Create ad-hoc corpus
+\end{itemize}
+A trivial example for a concept-based query expansion:
+Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
+\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
+Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
+\section{Semantic Mapping}
+\subsection{Profiles to Data Categories}
+CMD:Profile.Comp.Elem -> DatCat
+\subsection{Semantic Relations between (Data)Categories}
+Relation Registry
+!check DCR-RR/Odijk2010 -follow up
+!Cf. Erhard Hinrichs 2009
+\subsection{Mapping from strings to Entities}
+Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
+Identify related ontologies:
+LT-World \cite{Joerg2010}
+task:
+\begin{enumerate}
+\item  express MDRecords in RDF
+\item  identify related ontologies/vocabularies (category -> vocabulary)
+\item  implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
+\fbox{  function lookup: Category x String -> ConceptualDomain}
+Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
+\end{enumerate}
+\subsection{Semantic Search}
+Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
+Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
+with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
+In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
+Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
+Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
+?
+Facets
+Controlled Vocabularies
+Synonym Expansion (via TermExtraction(ContentSet))
+\subsection{Linked Data - Express dataset in RDF}
+Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
+So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
+Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
+But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
+Technical aspects (RDF-store?) / interface (ontology browser?)
+defining the Mapping:
+\begin{enumerate}
+\item convert to RDF
+translate: MDREcord -> [\#mdrecord \#property literal]
+\item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
+\end{enumerate}
+\subsection{Content/Annotation}
+AF + DCR + RR
+\subsection{Visualization}
+Landscape, Treemap, SOM
+Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
+\include{SMC}
 \include{System}
 …
 \section{Conclusions and Future Work}
+The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
+Further work is needed on more complex types of response (similarity ratio, relation types) and also on the interaction with Metadata Service to find the optimal way of providing the features of semantic mapping and query expansion as semantic search within the search user-interface.
 \section{Questions, Remarks}
 …
 \bibliographystyle{ieee}
 \bibliography{../../../2bib/lingua,../../../2bib/ontolingua}
+\bibliographystyle{ieeetr}
+\bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb}

SMC4LRT/System.tex

-                      r2669
+                      r2671
+\section{System Design}
+SOA
+\section{?? System}
+SOA?
 \subsection{DataModel}
 …
 RDF
+\subsection{Architecture}
+Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
+\begin{itemize}
+\item Data Category REgistry,
+\item Relation Registry
+\item Component Registry
+\item Vocabulary Alignement Service (OpenSKOS)
+\item SchemaParser
+\end{itemize}
+merging the pieces of information provided by those,
+offering them semi-transaprently to the user (or application) on the consumption side.
+\subsection{Query Language}
+CQL?
 \subsection{CMDI}
+\subsection*{Implementation}
+MDBrowser
+MDService
+The core function of the SMC is being implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
+\subsection{Query Language}
+CQL?
+\subsubsection{smc init}
+\subsubsection{smc browser}
+\subsubsection{smc as mdrepo module}
+\subsubsection{smc as VAS}
 \subsection{User Interface}

Note: See TracChangeset for help on using the changeset viewer.