Context Navigation

← Previous Change
Next Change →

Changeset 3680 for SMC4LRT

Timestamp:

10/04/13 22:47:37 (11 years ago)

Author:

vronk

Message:

adding Schema Matching info and application

Location:

SMC4LRT

Files:

: 9 edited

Outline.pdf (modified) (previous)
Outline.tex (modified) (3 diffs)
chapters/Data.tex (modified) (12 diffs)
chapters/Definitions.tex (modified) (1 diff)
chapters/Design_SMCinstance.tex (modified) (4 diffs)
chapters/Design_SMCschema.tex (modified) (2 diffs)
chapters/Literature.tex (modified) (6 diffs)
chapters/Results.tex (modified) (2 diffs)
utils.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/Outline.tex

-                      r3671
+                      r3680
 \input{chapters/Introduction}
+\end{comment}
 \input{chapters/Literature}
 \input{chapters/Definitions}
+\end{comment}
 \input{chapters/Data}
 …
 \input{chapters/Conclusion}
 \end{comment}
 …
 \bibliographystyle{ieeetr}
 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own, ../../2bib/diglib,../../2bib/it-misc}
+\bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own,../../2bib/diglib,../../2bib/it-misc}
 \appendix

SMC4LRT/chapters/Data.tex

-                      r3671
+                      r3680
 \section{Other Metadata Formats and Collections }
+\label{sec:lrt-md-catalogs}
 …
+\subsection{Dublin Core metadata terms + OLAC}
+Since 1995
+Maintained Dublin Core Metadata Initiative
+DC, OLAC
+"Dublin" refers to Dublin, Ohio, USA where the work originated during the 1995 invitational OCLC/NCSA Metadata Workshop,[8] hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA).
+comes in two version: 15 core elements  and 55 qualified terms ?
+\begin{quotation}
+Early Dublin Core workshops popularized the idea of "core metadata" for simple and generic resource descriptions. The fifteen-element "Dublin Core" achieved wide dissemination as part of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and has been ratified as IETF RFC 5013, ANSI/NISO Standard Z39.85-2007, and ISO Standard 15836:2009.
+\end{quotation}
+Given its simplicity it is used as the common denominator in many applications, among others it is the base format in the OAI-PMH protocol.
+It is required/expected as the base
+openarchives register: \url{http://www.openarchives.org/Register/BrowseSites}
+OAI-repositories
+DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
+DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/}
+\subsection{Dublin Core metadata terms}
+The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in  Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
+It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
+\begin{description}
+\item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/}
+the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
+\item[Dublin Core metadata terms ] \code{/terms/}
+the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
+\end{description}
+Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
+There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
+Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
+The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
+\subsection{OLAC}
 \label{def:OLAC}
+\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format\cite{Bird2001},OLAC \cite{Simons2003OLAC} is a more specialized version of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community:
+\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
+The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
 \begin{quotation}
 …
 \end{quotation}
+The \xne{OLAC Metadata} is the set of metadata elements archives participating in have agreed to use for describing language resources.
+\todoin{check http://www.language-archives.org/OLAC/metadata.html}
+ OLAC Archives contain over 100,000 records, covering resources in half of the world's living languages. More statistics on coverage.
+http://www.language-archives.org/
+Most of the OLAC records are integrated into CMDI (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC})
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record]
+<olac:olac>
+   <creator>Bloomfield, Leonard</creator>
+   <date>1933</date>
+   <title>Language</title>
+   <publisher>New York: Holt</publisher>
+</olac:olac>
+\end{
+OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
+Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
 \subsection{TEI / teiHeader}
 …
 \begin{quotation}
 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.
+The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/}
 \end{quotation}
+\url{http://www.tei-c.org/}
+TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs.
+Thus there is also not just one fixed \xne{teiHeader}.
+ TEI/teiHeader/ODD,
+encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
+\begin{quotation}
+ The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
+\ebnd{quotation}
+TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
+Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
+There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
 \subsection{ISLE/IMDI}
 …
 http://www.mpi.nl/imdi/
+\begin{quotation}
 The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
+\end{quotation}
+\subsection{LAT, TLA}
+Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
 Predecessor of CMDI
 …
 Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.
 \subsection{ESE, Europeana Data Model - EDM}
 …
+\subsection{META-NET}
+\begin{quotation}
+META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
+META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
+\end{quotation}
+The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
+A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+\subsection{ELRA}
+European Language Resources Association\furl{http://elra.info}
+\begin{quotation}
+ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
+\end{quotation}
+http://www.elda.org/
+Evaluations and Language resources Distribution Agency
+ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns.
+ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
+ELRA Catalog
+http://catalog.elra.info/
+Universal Catalog+
+ Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
 \subsection{Other}
 …
 \item Persons - GND, VIAF
 \item Organizations - GND, VIAF
 \item SchlagwÃ¶rter/Subjects - GND, LCSH
+\item Schlagw\"{o}rter/Subjects - GND, LCSH
 \item Resource Typology -
 \end{itemize}
 …
 Other related relevant activities and initiatives
+http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values
 A broader collection of related initiatives can be found at the German National Library website:
 \furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html}
 …
 http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
 At MPDL, within the escidoc publication platform there seems to be (work  on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities}
 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities â developed at the New Zealand Electronic Text Centre (NZETC).
+Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC).
 http://eats.readthedocs.org/en/latest/
 …
 \subsubsection{LT-World}
+Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
+\section{LRT Metadata Catalogs/Collections}
+\label{sec:lrt-md-catalogs}
+\todoin{Overview of catalogs, name, since, \#providers, \#resources}
+\todoin{[DFKI/LT-World]  - collection or ontology}
+\subsection{CMDI}
+collections, profiles/Terms, ResourceTypes!
+\subsection{OLAC}
+\subsection{LAT, TLA}
+Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
+\subsection{META-NET}
+\begin{quotation}
+META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
+META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
+\end{quotation}
+The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
+A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+\subsection{ELRA}
+European Language Resources Association
+\furl{http://elra.info}
+ELRAâs missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
+http://www.elda.org/
+Evaluations and Language resources Distribution Agency
+ELDA - Evaluations and Language resources Distribution Agency â is ELRAâs operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT â Human Language Technology â community. Besides, ELDA is involved in HLT evaluation campaigns.
+ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
+ELRA Catalog
+http://catalog.elra.info/
+Universal Catalog+
+ Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
+\subsection{Other}
+Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
 …
+VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
+http://www.dnb.de/rdf
+the entire WorldCat cataloging collection made publicly
+available using Schema.org mark-up with library extensions for use by developers and
+search partners such as Bing, Google, Yahoo! and Yandex
+OCLC begins adding linked data to WorldCat by appending
+Schema.org descriptive mark-up to WorldCat.org pages, thereby
+making OCLC member library data available for use by intelligent
+Web crawlers such as Google and Bing
+\subsection{schema.org}
+http://schema.org/docs/datamodel.html
+microdata or
+http://www.w3.org/TR/rdfa-lite/
+ Resource Description Framework in attributes
 \section{Summary}

SMC4LRT/chapters/Definitions.tex

r3671	r3680
52	52	dcterms: & http://purl.org/dc/terms \\
53	53	oa: & http://www.w3.org/ns/oa\# \\
	54	olac: & http://www.language-archives.org/OLAC/1.1/ \\
54	55	ore: & http://www.openarchives.org/ore/terms/ \\
55	56	cr: & http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/ \\

SMC4LRT/chapters/Design_SMCinstance.tex

-                      r3671
+                      r3680
 \end{enumerate}
+This task is basically an application of ontology mapping method.
+We don't try to achieve complete ontology alignment, we just want to find
+for our ``anonymous'' concepts semantically equivalent concepts from other ontologies.
+This is very near just other phrasing for the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:
+``for each concept (node) in ontology A [tries to] find a corresponding concept
+(node), which has the same or similar semantics, in ontology B and vice verse''.
+The first two points in the above enumeration represent the steps necessary to be able to apply the ontology mapping.
+The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated ontology to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies.
+The transformation of the data has been partly described in previous section:
+It can be trivially automatically converted into RDF triples as :
+\begin{example3}
+<lr1> & cmd:Organisation & "MPI" \\
+\end{example3}
+However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs:
+\begin{example3}
+\_:1 & a & cmd:Organisation;\\
+   & skos:altLabel & "MPI";
+\end{example3}
+\var{lookup} function is a customized version of the \var(map) function, that operates on this information pairs (concept, label).
+The two steps \var{lookup} and \var{assess} correspond exactly to the two steps in \cite{jimenez2012large} in their system \xne{LogMap2}: 1) computation of mapping candidates (maximise recall) and b) assessment of the candidates (maximize precision)
 \begin{figure*}[!ht]
 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
 …
 %%%%%%%%%%%%%%%%%%%%%
 \section{SMC LOD - Semantic Web Application}
 …
-\cite{Europeana RDF Store Report}
-Technical aspects (RDF-store?): Virtuoso
-\todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
-\todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/}
-\todocode{check install siren}\furl{http://siren.sindice.com/}
-\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
-\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
- / interface (ontology browser?)
-semantic search component in the Linked Media Framework
-\todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch}
-\todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
-\section {Full semantic search - concept-based + ontology-driven ?}
-\label{semantic-search}
 With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
 …
 rechercheisidore, dbpedia, ...
+\cite{Europeana RDF Store Report}
+Technical aspects (RDF-store?): Virtuoso
+semantic search component in the Linked Media Framework
+\todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
+%\section {Full semantic search - concept-based + ontology-driven ?}
+%\label{semantic-search}
 \section{Summary}
+%The task can be also seen as building bridge between XML resources and semantic resources expressed in RDF, OWL.
+The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
 In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3665
+                      r3680
 \caption{Attributes of \code{Term} when encoding CMD entity}
 \label{table:terms-attributes-cmd}
+ \begin{tabularx}{1\textwidth}{ l | X | X }
+\begin{tabularx}{1\textwidth}{ l | X | X }
+ %\begin{tabu}{1\textwidth}{ l | l | l }
   attribute & allowed values & sample value\\
 \hline
 …
 Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
+%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Application of Schema Matching techniques in SMC}
+\label{sec:schema-matching-app}
+Even though the described module is about ``semantic mapping'',  until now  we did not directly make use of the traditional ontology/schema mapping/alignment methods and tools as summarized in \ref{lit:schema-matching}. This is due
+to the fact that the in this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation,
+to a high degree obsoleting the need for a posteriori complex schema matching/mapping techniques.
+Or put in terms of the schema matching methodology, the system relies on explicitely set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
+However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
+Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
+The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{We talk of schema even though the creation (and also remodelling) takes place in the component registry by creating CMD profiles and components, because every profile has an unambiguous expression in XML Schema.} \var{$S_{1..n}$}.
+It is very unprobable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
+Given the heterogenity of the schemas present in the field of research, full alignments are not achievable at all.
+However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
+components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
+Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, she is helped even with candidates that are not equivalent, thus we can further relax the task and allow even candidates that are just similar to a certain degree, that can be operationalized as threshold $t$ on the output of the \var{similarity} function
+Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
+Another requirement is that the matching entities should be maximal regarding the compositional tree:
+\begin{defcap}[!ht]
+\caption{}
+\begin{align*}
+& map(e_{x1})  \rightarrow c_{y1} \\
+& map(e_{x2})  \rightarrow c_{y2} \\
+& candidates := \{(e_{xa},c_{ya}) : c_{ya} \notin descendants(c_{yb}) \}
+\end{align*}
+\end{defcap}
+Next to the usual features and measures that can be applied like label equality or string-similarity and structural equality,
+the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}.
+It would be worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature.
+longest matching subpath.
+Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
+Note, that in the case of reuse of components, in the normal scenario, the semantic equivalency is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well, thus by default the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
+The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
+However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
+Once all the equivalencies (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
+This new simliarity ratios could be applied as alternative weights in the just-profiles graph \ref{sec:smc-cloud}.
+In contrast to the task described here, that -- restricted matching XML schemas -- can be seen as staying in the ``XML World'',
+another aspect within this work is clearly situated in the Semantic Web world and requires application of ontology matching methods, the mapping of field values to semantic entities described in \ref{sec:values2entities}.
+%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
 \section{Summary}
 In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
 The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.
+In addition, we elaborated on the application of schema matching methods to infer mappings between schemas.

SMC4LRT/chapters/Literature.tex

-                      r3671
+                      r3680
 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.
+A number of catalogs and formats are further described in the section \ref{sec:other-md-catalogs}
+\section{Schema / Ontology Mapping/Matching}
+Schema or ontology matching provides the methodical foundation for the problem at hand the \emph{semantic mapping}.
+As Shvaiko\cite{shvaiko2012ontology} states ``a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
+One starting point for the plethora of work in the field of \emph{schema and ontology mapping} techniques and technology
+is the overview of the field by Kalfoglou \cite{Kalfoglou2003}.
+Shvaiko and Euzenat provide a summary of the key challenges\cite{Shvaiko2008} as well as a comprehensive survey of approaches for schema and ontology matching based on a proposed new classification of schema-based matching techniques\cite{Shvaiko2005_classification}.
+Noy \cite{Noy2005_ontologyalignment,Noy2004_semanticintegration}
+and more recently \cite{shvaiko2012ontology}(2012!) and \cite{amrouch2012survey} provide surveys of the methods and systems in the field.
+\paragraph{Methods}
+Semantic and extensional methods are still rarely
+employed by the matching systems. In fact, most of
+the approaches are quite often based only on
+terminological and structural methods
+classify, review, and experimentally compare major methods of element similarity measures and their combinations.\cite{Algergawy2010}
+The related catalogs and formats are described in the section \ref{sec:other-md-catalogs}
+\section{Existing crosswalks (services)}
+Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems and also in libraries,  e.g. \emph{MARC to Dublin Core Crosswalk}\furl{http://loc.gov/marc/marc2dc.html}
+\cite{Day2002crosswalks} lists a number of mappings between metadata formats.
+Mostly Dublin Core and MARC family of formats
+http://www.loc.gov/marc/dccross.html
+static
+metadata crosswalk repository
+OCLC launched \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
+in particular \xne{Crosswalk Web Service}\furl{http://www.oclc.org/developer/services/metadata-crosswalk-service}
+http://www.oclc.org/research/activities/xwalk.html
+\begin{quotation}
+a self-contained crosswalk utility that can be called by any application that must translate metadata records. In our implementation, the translation logic is executed by a dedicated XML application called the Semantic Equivalence Expression Language, or Seel, a language specification and a corresponding interpreter that transcribes the information in a crosswalk into an executable format.
+\end{quotation}
+the Crosswalk Web Service is now a production system that has been incorporated into the following OCLC products and services.
+However the demo service is not available\furl{http://errol.oclc.org/schemaTrans.oclc.org.search}
+Offered formats?
+These however concentrate on the formats for the LIS community available and are ??
+For this service, a metadata format is defined as a triple of:
+    Standardâthe metadata standard of the record (e.g. MARC, DC, MODS, etc ...)
+    Structureâthe structure of how the metadata is expressed in the record (e.g. XML, RDF, ISO 2709, etc ...)
+    Encodingâthe character encoding of the metadata (e.g. MARC8, UTF-8, Windows 1251, etc ...)
+Offered interface!?
+he Crosswalk Web Service has 4 methods:
+    translate(...) - This method translates the records. See the documentation for more information.
+    getSupportedSourceRecordFormats() - This method returns a list of formats that are supported as input formats.
+    getSupportedTargetRecordFormats() - This method returns a list of formats that the input formats can be translated to.
+    getSupportedJavaEncodings() - Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports.
+\section{Schema/Ontology Mapping/Matching}
+\label{lit:schema-matching}
+As Shvaiko\cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
+As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities} we elaborate on the possible ways to apply these methods to the described problem.)
+There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work \cite{Kalfoglou2003, Shvaiko2008, Noy2005_ontologyalignment, Noy2004_semanticintegration, Shvaiko2005_classification} and most recently \cite{shvaiko2012ontology, amrouch2012survey}.
+%Shvaiko and Euzenat provide a summary of the key challenges\cite{Shvaiko2008} as well as a comprehensive survey of approaches for schema and ontology matching based on a proposed new classification of schema-based matching techniques\cite{}.
+Shvaiko and Euzenat also run the web page \url{http://www.ontologymatching.org/} dedicated to this topic and the related OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}}, an ongoing effort to evaulate alignment tools based on various alignment tasks from different domains.
+Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?''.
+\subsubsection{Method}
+There are slight differences in use of the terms between \cite{EhrigSure2004, Ehrig2006}, \cite{Euzenat2007} and \cite{amrouch2012survey}, especially one has to be aware if in given context the term denotes the task in general, the process, the actual operation/function or the result of the function.
+\cite{Euzenat2007} formalizes the problem as ``ontology matching operation'':
+\begin{quotation}
+The matching operation determines an alignment A' for a
+pair of ontologies O1 and O2. Hence, given a pair of
+ontologies (which can be very simple and contain one entity
+each), the matching task is that of finding an alignment
+between these ontologies. [\dots]
+\end{quotation}
+But basically the different authors broadly agree on the definition of \var{ontology alignment} in the meaning \concept{task} is ``to identify relations between individual elements of mulitple ontologies'', or as \concept{result} ``a set of correspondences between entities belonging to the matched ontologies''.
+More formally \cite{Ehrig2006} formulates ontology alignment as ``a partial function based on the set \var{E} of all entities $e \in E$ and based on the set of possible ontologies \var{O}. [\dots] Once an alignment is established we say entity \var{e} is aligned with entity \var{f} when \var{align(e) \ = \ f}.'' Also, ``alignment is a one-to-one equality relation.'' (although this is relativized further in the work, and also in \cite{EhrigSure2004} )
+\begin{definition}{\var{align} function}
+align \ : E \times O \times O \rightarrow E
+\end{definition}
+\cite{EhrigSure2004} and \cite{amrouch2012survey} instead introduce \var{ontology mapping} when applying the task on individual entities, in the meaning as a function that ``for each concept (node) in ontology A [tries to] find a corresponding concept
+(node), which has the same or similar semantics, in ontology B and vice verse''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''.
+\cite{EhrigSure2004} further specify the mapping function as based on a similarity function, that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities.
+\begin{defcap}[!ht]
+\caption{\var{map} function for single entities and underlying \var{similarity} function }
+\begin{align*}
+& map \ : O_{i1}  \rightarrow O_{i2} \\
+& map( e_{i_{1}j_{1}}) = e_{i_{2}j_{2}}\text{, if } sim(e_{i_{1}j_{1}},e_{i_{2}j_{2}}) \ \textgreater \ t  \text{ with } t \text{ being the threshold} \\
+& sim \ : E \times E \times O \times O \rightarrow [0,1]
+\end{align*}
+\end{defcap}
+This elegant abstraction introduced with the \var{similarity} function provides a general model that can accomodate a broad range of comparison relationships and corresponding similarity measures. And here, again, we encounter a broad range of possible approaches.
+\cite{ehrig2004qom} lists a number of basic features and corresponsing similarity measures:
+Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed.
+For concepts, next to the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again either as string similarity, but also broaded by employing external taxonomies and other semantic resources like WordNet - \emph{extensional} methods), equal (shared) class instances, shared superclasses, subclasses, properties.
+Element-level (terminological)  vs structure-level (structural)  \cite{Shvaiko2005_classification}
+based on background knowledge...
+subclassâsuperclass relationships, domains and ranges of properties, analysis of the graph structure of the ontology.
+For properties the degree of the super an subproperties equality, overlapping domain and/or range.
+Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy).
+\cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods.
+\cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function\cite{ehrig2004qom}, that combines all similiarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.)
+Thus, \var{ontology similarity} is a much weaker assertion, than \var{ontology alignment}, in fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment.
+As to the alignment process, \cite{Ehrig2006} distinguishes following steps:
+\begin{enumerate}
+\item Feature Engineering
+\item Search Step Selection
+\item Similarity Assessment
+\item Interpretation
+\item Iteration
+\end{enumerate}
+In  contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision), that however correspond  to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system.
 \subsubsection{Systems}
+A number of existing systems for schema/ontology matching/alignment is mentioned in this overview publications:
+The majority of tools for ontology mapping use some sort of structural or
+definitional information to discover new mappings. This information includes
+such elements as subclassâsuperclass relationships, domains and ranges of
+properties, analysis of the graph structure of the ontology, and so on. Some of
+the tools in this category include
+IF-Map\cite{kalfoglou2003if}
+QOM\cite{ehrig2004qom},
+Similarity Flooding\cite{melnik}
+the Prompt tools \cite{Noy2003_theprompt} integrating with Protege
+\xne{COMA++} \cite{Aumueller2005}  composite approach to combine different match algorithms, user interaction via graphical interface , supports W3C XML Schema and OWL.
+\xne{FOAM}\cite{EhrigSure2005}
+Ontology matching system \xne{LogMap 2} \cite{jimenez2012large} supports user interaction and implements scalable reasoning and diagnosis algorithms, which minimise any logical inconsistencies introduced by the matching process.
+The process is divided into two main logical phases: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision).
+        s which are at the core of the mapping task.
+On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
+One more specific recent inspirational work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
+Matching is laborious and error-prone process, and once ontology
+mappings are discovered, i
+\subsection{MOVEOUT: Application of Schema Matching on the CMD domain}
+Notice, that this the semantic interoperability layer built into the core of the CMD Infrastructure, integrates the
+task of identifying semantic correspondences directly into the process of schema creation,
+largely removing the need for complex schema matching/mapping techniques in the post-processing.
+However this is only holds for schemas already created within the CMD framework,
+Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching
+Such a procedure pays tribute to the fact, that the mapping techniques are mostly error-prone and can deliver reliable 1:1 alignments only in trivial cases. This lies in the nature of the problem, given the heterogenity of the schemas present in the data collection, full alignments are not achievable at all, only parts of individual schemas actually semantically correspond.
+Once all the equivalencies (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
+The task can be also seen as building bridge between XML resources and semantic resources expressed in RDF, OWL.
+This speaks for a tool like COMA++ supporting both W3C standards:  XML Schema and OWL.
+Concentration on existing systems with user interface?
+The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
+In the end
+It is also not the goal to merge
+Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
+infrastructure un
+This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
+Application of ontology/schema matching/mapping techniques
+is reduced or outsourced
+\subsection{Existing Crosswalk services}
+\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}
+VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
+http://www.dnb.de/rdf
+the entire WorldCat cataloging collection made publicly
+available using Schema.org mark-up with library extensions for use by developers and
+search partners such as Bing, Google, Yahoo! and Yandex
+OCLC begins adding linked data to WorldCat by appending
+Schema.org descriptive mark-up to WorldCat.org pages, thereby
+making OCLC member library data available for use by intelligent
+Web crawlers such as Google and Bing
+A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications:
+IF-Map \cite{kalfoglou2003if}, QOM \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, Similarity Flooding (SF) \cite{melnik}, S-Match \cite{Giunchiglia2007_semanticmatching}, the Prompt tools \cite{Noy2003_theprompt} integrating with ProtÃ©gÃ© or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
+All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
+Next to OWL as input format supported by all the systems some also accept XML Schemas (\xne{COMA++, SF, Cupid, SMatch}),
+some provide a GUI (\xne{COMA++, Chimaera, PROMPT, SAMBO, AgreementMaker}).
+Scalability is one factor to be considered, given that in a baseline scenario (before considering efficiency optimisations in candidate generation) the space of possible candidate mappings is the cartesian product of entities from the two ontologies being aligned. Authors of the (refurbished) ontology matching system \xne{LogMap 2} \cite{jimenez2012large} hold that it implements scalable reasoning and diagnosis algorithms, performant enough, to be integrated with the provided user interaction.
 %%%%%%%%%%%%%%%%%%%%%%%%%%5
 …
 Linked Data paradigm\cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.
+\subsection{Semantic Web - Technical solutions / Server applications}
+\subsubsection{Semantic Web - Technical solutions / Server applications}
 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently
 …
 Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction.
+One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
 \begin{comment}
 LDpath\furl{http://code.google.com/p/ldpath/}
 …
 \end{comment}
 \subsection{Ontology Visualization}
+\subsubsection{Ontology Visualization}
 Landscape, Treemap, SOM
 …
+%%%%%%%%%%%%%%%%
 \section{Language and Ontologies}
 …
 and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009})
 Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications.
+Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications, provides a RDF serialization (?!?!).
 An overview of current developments in application of the linked data paradigm for linguistic data collections was given at the  workshop Linked Data in Linguistics\furl{http://ldl2012.lod2.eu/} 2012 \cite{ldl2012}.

SMC4LRT/chapters/Results.tex

-                      r3671
+                      r3680
 \subsubsection{teiHeader}
+\label{results:tei}
 TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
 TEI does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}.
 …
 %%%%%%%%%%%%%%%%%%%%%%%
 \subsection{SMC cloud}
+\label{sec:smc-cloud}
 As a latest, still experimental, addition, SMC browser provides a special type of graph, that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
 covering a large part of the defined profiles. It shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles.

SMC4LRT/utils.tex

r3666	r3680
19	19	\usepackage{framed}
20	20	\usepackage{amsmath}
21		%\usepackage{tabularx}
	21	\usepackage{tabularx}
22	22	\usepackage{tabu}
23	23

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 3680 for SMC4LRT

Legend:

Download in other formats: